5.2 分类函数

5.2.1 DecisionTreeClassifier(决策树)

Decision Trees (DTs)创建一种模型从数据特征中学习简单的决策规则来预测一个目标变量的值。

class sklearn.tree.DecisionTreeClassifier(criterion=‘gini’, splitter=‘best’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False)

参数解释(Parameters):

criterion : string, optional (default=”gini”).      # 度量节点分裂质量的函数. “gini”表示用基尼指数的度量方式，
        “entropy”表示用信息增益的度量方式。

splitter : string, optional (default=”best”)       # 用于选择每个节点上分割的策略. “best”表示the best split，
        “random” 表示the best random split.

max_depth : int or None, optional (default=None).     # 树的最大深度. 如果设置为None, 那么树的节点将会扩展，直至所有叶子节点是纯净的，
            或者所有叶子节点包含的样本数少于 min_samples_split samples.

min_samples_split : int, float, optional (default=2)   # 节点进行划分所需要的最少样本数：
                If int, then consider min_samples_split as the minimum number.
                If float, then min_samples_split is a percentage and ceil(min_samples_split * n_samples)
                are the minimum number of samples for each split.

min_samples_leaf : int, float, optional (default=1)    # 在叶子节点所需的最小样本数:
                If int, then consider min_samples_leaf as the minimum number.
                If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf * n_samples) are
                the minimum number of samples for each node.

min_weight_fraction_leaf : float, optional (default=0.)  # 需要在叶节点上的权重（所有输入样本）的总和的最小加权百分率。
                当不提供样本权重时，样本具有相同的权重。

max_features : int, float, string or None, optional (default=None).   # 当寻找最优分裂时，考虑的特征数量：
                If int, then consider max_features features at each split.
                If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.
                If “auto”, then max_features=sqrt(n_features).
                If “sqrt”, then max_features=sqrt(n_features).
                If “log2”, then max_features=log2(n_features).
                If None, then max_features=n_features.

random_state : int, RandomState instance or None, optional (default=None).  #
                If int, random_state 作为随机数生成器的种子。
                If RandomState instance, random_state is 随机数生成器;
                If None, the random number generator is the RandomState instance used by np.random.

max_leaf_nodes : int or None, optional (default=None).            # Grow a tree with max_leaf_nodes in best-first fashion.
            Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

min_impurity_decrease : float, optional (default=0.).          # 如果这次分裂导致节点不纯度的下降大于或等min_impurity_decrease，该节点将会分裂。

min_impurity_split : float,      # 树生长过程中，提前停止的阈值。 如果一个节点的不纯度高于该阈值，节点将会分裂；否则，该节点为叶节点。

class_weight : dict, list of dicts, “balanced” or None, default=None.     # Weights associated with classes in the form
    {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can
    be provided in the same order as the columns of y.

presort : bool, optional (default=False).   # 是否要对数据进行预分类，以加快训练中找到最佳分割。

属性(Attributes)：

classes_ : array of shape = [n_classes] or a list of such arrays.   # 类别标签(single output problem),
        or 类别标签的数组List(multi-output problem).
feature_importances_ : array of shape = [n_features].           # 特征重要性，值越大表示特征越重要，体现了以该特征进行节点分裂，
               不纯度下降越快。
max_features_ : int,                                 # 推测出的最大特征数量.
n_classes_ : int or list                               # The number of classes (for single output problems),
        or a list containing the number of classes for each output (for multi-output problems).
n_features_ : int                                   # 当进行拟合时，所用的特征数量.
n_outputs_ : int                                    # The number of outputs when fit is performed.
tree_ : Tree object                                  # 训练拟合之后的树对象.

方法(Methods):

apply(X[, check_input])                  # 返回每个样本被预测为的叶子的索引。
decision_path(X[, check_input])             # 返回树的决策路径。
fit(X, y[, sample_weight, check_input, …])     # Build a decision tree classifier from the training set (X, y).
get_params([deep])                      # 获得this estimator的参数。
predict(X[, check_input])                  # Predict class or regression value for X.
predict_log_proba(X)                     # Predict class log-probabilities of the input samples X.
predict_proba(X[, check_input])             # Predict class probabilities of the input samples X.
score(X, y[, sample_weight])                # Returns the mean accuracy on the given test data and labels.
set_params(**params)                     # Set the parameters of this estimator.

5.2.2 AdaBoostClassifier(串行集成-AdaBoost)

AdaBoost 的核心思想是用反复修正数据的权重来训练一系列的弱学习器(一个弱学习器模型仅仅比随机猜测好一点, 比如一个简单的决策树),由这些弱学习器的预测结果通过加权投票(或加权求和)的方式组合, 得到我们最终的预测结果。

class sklearn.ensemble.AdaBoostClassifier(base_estimator=None, n_estimators=50, learning_rate=1.0, algorithm=‘SAMME.R’, random_state=None)

参数解释(Parameters):

base_estimator : object, optional (default=DecisionTreeClassifier).      # 设置基分类器。

n_estimators : integer, optional (default=50).                    # 在boosting 终止时，基分类器的最大数量。
            在极好的拟合情况下，学习过程被提前停止。

learning_rate : float, optional (default=1.).                    # 通过设置学利率，学习率缩减每个分类器的贡献.
            There is a trade-off between learning_rate and n_estimators.

algorithm : {‘SAMME’, ‘SAMME.R’}, optional (default=’SAMME.R’).    # If ‘SAMME.R’ then use the SAMME.R real boosting
        algorithm. base_estimator must support calculation of class probabilities. If ‘SAMME’ then use the SAMME discrete boosting
        algorithm. The SAMME.R algorithm typically converges faster than SAMME, achieving a lower test error with fewer
        boosting iterations.

random_state : int, RandomState instance or None, optional (default=None).
        If int, random_state is the seed used by the random number generator;
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used by np.random

属性(Attributes)：

estimators_ : list of classifiers                #  拟合后的基分类器的集合。
classes_ : array of shape = [n_classes]            # 类别标签.
n_classes_ : int                           # The number of classes.
estimator_weights_ : array of floats              # 在boosted 集成中，每个基分类器的权重。
estimator_errors_ : array of floats               # 在boosted集成中，每个基分类器的分类误差。
feature_importances_ : array of shape = [n_features]    # The feature importances if supported by the base_estimator

方法(Methods):

decision_function(X)                      # 计算 X的决策函数值.
fit(X, y[, sample_weight])                 # Build a boosted classifier from the training set (X, y).
get_params([deep])                       # Get parameters for this estimator.
predict(X)                            # Predict classes for X.
predict_log_proba(X)                      # Predict class log-probabilities for X.
predict_proba(X)                         # Predict class probabilities for X.
score(X, y[, sample_weight])                 # Returns the mean accuracy on the given test data and labels.
set_params(**params)                      # Set the parameters of this estimator.
staged_decision_function(X)                # Compute decision function of X for each boosting iteration.
staged_predict(X)                        # Return staged predictions for X.
staged_predict_proba(X)                   # Predict class probabilities for X.
staged_score(X, y[, sample_weight])           # Return staged scores for X, y.

5.2.3 RandomForestClassifier(并行集成-随机森林)

随机森林就是通过集成学习的思想将多棵树集成的一种算法(并行集成)，它的基本单元是决策树。每棵决策树都是一个分类器，那么对于一个输入样本，N棵树会有N个分类结果。而随机森林集成了所有的分类投票结果，将投票次数最多的类别指定为最终的输出，这就是一种最简单的 Bagging 思想。

class sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion=‘gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=‘auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None)

参数解释(Parameters):

n_estimators : integer, optional (default=10).     # 森林中的树的棵数。
criterion : string, optional (default=”gini”)     # 节点分裂准则。
max_features : int, float, string or None, optional (default=”auto”)  # 解释同决策树
max_depth : integer or None, optional (default=None)
min_samples_split : int, float, optional (default=2)
min_samples_leaf : int, float, optional (default=1)
min_weight_fraction_leaf : float, optional (default=0.)
max_leaf_nodes : int or None, optional (default=None)
min_impurity_split : float,
min_impurity_decrease : float, optional (default=0.)
bootstrap : boolean, optional (default=True)      # 在构建树时是否使用引导样本(bootstrap samples)。
oob_score : bool (default=False)              # 是否使用out-of-bag samples来估计泛化精度。
n_jobs : integer, optional (default=1)         # 同时运行和预测的并行作业数量。如果设为-1，那么作业的数量被设置为内核数量。
random_state : int, RandomState instance or None, optional (default=None)
verbose : int, optional (default=0)           # 控制树构建过程的冗长。
warm_start : bool, optional (default=False)      # 当设置为true时，重用前一个调用的解决方案来拟合，并将更多的基分类器添加到集合中，
否则，只需拟合一个完整的新的林。
class_weight : dict, list of dicts, “balanced”

属性(Attributes)：

estimators_ : list of DecisionTreeClassifier    # 拟合完成后，随机森林中树对象的集合。
classes_ : array of shape = [n_classes] or a list of such arrays
n_classes_ : int or list
n_features_ : int
n_outputs_ : int                      # The number of outputs when fit is performed.
feature_importances_ : array of shape = [n_features]
oob_score_ : float
oob_decision_function_ : array of shape = [n_samples, n_classes]          # 在训练集上用袋外估计计算决策函数。

方法(Methods):

apply(X)                         # 对于X，运用森林中的一棵树, 返回叶节点的索引。
decision_path(X)                    # Return the decision path in the forest
fit(X, y[, sample_weight])             # Build a forest of trees from the training set (X, y).
get_params([deep])                  # Get parameters for this estimator.
predict(X)                       # Predict class for X.
predict_log_proba(X)                 # Predict class log-probabilities for X.
predict_proba(X)                    # Predict class probabilities for X.
score(X, y[, sample_weight])            # Returns the mean accuracy on the given test data and labels.
set_params(**params)                 # Set the parameters of this estimator.