2.2 分类函数

2.2.1 GuassianNB(高斯朴素贝叶斯)

高斯朴素贝叶斯方法是基于贝叶斯定理的一组有监督学习算法,即“简单”地假设每对特征之间相互独立。类别的判定是通过后验概率最大化,先验概率可以人为设定,也可以根据样本数据推算,假定类条件概率服从高斯分布,即类别的特征向量在特征空间中服从正太分布,通过训练样本学习到每个类的高斯分布的参数(均值和协方差矩阵),最终便学习到朴素贝叶斯分类器。

class sklearn.naive_bayes.GaussianNB(priors=None)

参数解释(Parameters):

priors : array-like, shape (n_classes,)      #类别的先验概率。 如果不指定,则根据样本数据推算先验。.

属性(Attributes):

class_prior_ : array, shape (n_classes,)    #每个类的先验概率
class_count_ : array, shape (n_classes,)    #每个类别中观测到的训练样本数
theta_ : array, shape (n_classes, n_features)   #每类特征的均值
sigma_ : array, shape (n_classes, n_features)   #每类特征的方差

方法(Methods):

fit(X, y[, sample_weight])  #通过训练样本 X, y来学习 Gaussian Naive Bayes 分类器.{X的形式为(n_samples,n_features)}
get_params([deep])  #获取分类器的参数.
predict(X)  #对测试数据进行分类预测。{X的形式为(n_samples,n_features)}
predict_log_proba(X)    #对于测试样本特征向量X,得到取对数的概率估计值
predict_proba(X)    #对于测试样本特征向量X,得到概率估计
score(X, y[, sample_weight])    #返回给定测试样本和真实Label的平均精度(X为测试样本,y为真实Label)
set_params(**params)    #设置分类器的参数

2.2.2 NearestNeighbors(最近邻)

最近邻分类属于基于实例的学习或非泛化学习 :它不会去构造一个泛化的内部模型,而是简单地存储训练数据的实例。 分类是由每个点的最近邻的简单多数投票中计算得到的,一个查询点的数据类型是由它最近邻点中最具代表性的数据类型来决定的。

class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights=’uniform’,algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n_jobs=1, **kwargs)

参数解释(Parameters):

n_neighbors : int, optional (default = 5)   # 默认使用kneighbors搜索时,neighbors(临近)的数量

weights : str or callable, optional (default = ‘uniform’)     # 在预测时使用的权重函数. Possible values:
‘uniform’ : uniform weights. All points in each neighborhood are weighted equally.
‘distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a
           greater influence than neighbors which are further away.
 [callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing
           the weights.

algorithm : {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional   #用于计算最近邻的算法::
‘ball_tree’ will use BallTree
‘kd_tree’ will use KDTree
‘brute’ will use a brute-force search.
‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to fit method.
 注意: fitting on sparse input will override the setting of this parameter, using brute force.

leaf_size : int, optional (default = 30)   #Leaf size passed to BallTree or KDTree. This can affect the speed of the construction
and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

p : integer, optional (default = 2)   # Minkowski距离度量的幂指数参数. When p = 1, this is equivalent to using
manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

metric : string or callable, default ‘minkowski’     # the distance metric to use for the tree.
The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric.
See the documentation of the DistanceMetric class for a list of available metrics.

metric_params : dict, optional (default = None)    # Additional keyword arguments for the metric function.

n_jobs : int, optional (default = 1)     # The number of parallel jobs to run for neighbors search.
If -1, then the number of jobs is set to the number of CPU cores. Doesn’t affect fit method.

方法(Methods):

fit(X, y)                            # 通过训练样本 X, y来学习 KNeighborsClassifier分类器
get_params([deep])                      # 获取estimator的参数
kneighbors([X, n_neighbors, return_distance])    # 找到一个样本点的 K-neighbors
kneighbors_graph([X, n_neighbors, mode])        # Computes the (weighted) graph of k-Neighbors for points in X
predict(X)                           # 对X进行预测类别
predict_proba(X)                        # 对于样本特征向量X,得到概率估计.
score(X, y[, sample_weight])                # 返回给定测试样本和真实Label的平均精度(X为测试样本,y为真实Label)
set_params(**params)                     # 设置分类器的参数

2.2.3 cross_val_score(交叉验证)

计算交叉验证的精度: 获得一个数组,该数组中每个元素是交叉验证时每次运行的一个精度值。

sklearn.cross_validation.cross_val_score(estimator, X, y=None, scoring=None, cv=None, n_jobs=1, verbose=0, fit_params=None, pre_dispatch=‘2*n_jobs’)

参数解释(Parameters):

estimator : estimator object implementing ‘fit’            # The object to use to fit the data.
X : array-like         # The data to fit. Can be for example a list, or an array.
y : array-like, optional, default: None      # The target variable to try to predict in the case of supervised learning.
groups : array-like, with shape (n_samples,), optional .      # Group labels for the samples used while splitting the
dataset into train/test set.
scoring : string, callable or None, optional, default: None.     # A string (see model evaluation documentation) or a scorer
callable object / function with signature scorer(estimator, X, y).

cv : int, cross-validation generator or an iterable, optional.      #   Determines the cross-validation splitting strategy.
Possible inputs for cv are:
    None, to use the default 3-fold cross validation,
    integer, to specify the number of folds in a (Stratified)KFold,
    An object to be used as a cross-validation generator.
    An iterable yielding train, test splits.
For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used.
In all other cases, KFold is used.

n_jobs : integer, optional         # The number of CPUs to use to do the computation. -1 means ‘all CPUs’.
verbose : integer, optional         # The verbosity level.
fit_params : dict, optional         # Parameters to pass to the fit method of the estimator.

pre_dispatch : int, or string, optional    # Controls the number of jobs that get dispatched during parallel execution.
          Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched
          than CPUs can process.
This parameter can be:
    None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs,  to avoid delays due to on-demand spawning of the jobs.
    An int, giving the exact number of total jobs that are spawned.
    A string, giving an expression as a function of n_jobs, as in ‘2*n_jobs’.

2.2.4 GridSearchCV(网格搜素)

实现自动调参,只要把参数输进去,就能给出最优化的结果和参数。用于系统地遍历多种参数组合,通过交叉验证确定最佳效果参数。 但是这个方法适合于小数据集,一旦数据的量级上去了,运行时间很长,较难得出结果。

class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch=‘2*n_jobs’,error_score=’raise’, return_train_score=’warn’)

参数解释(Parameters):

estimator:estimator object.   #所使用的分类器,estimator=RandomForestClassifier(min_samples_split=100,min_samples_leaf=20,
        max_depth=8,max_features='sqrt',random_state=10),
      并且传入除需要确定最佳的参数之外的其他参数。每一个分类器都需要一个scoring参数,或者score方法。

param_grid:值为字典或者列表,即需要最优化的参数的取值,param_grid =param_test1,param_test1 = {'n_estimators':range(10,71,10)}。

scoring :准确度评价标准,默认None,这时需要使用score函数;或者如scoring='roc_auc',根据所选模型不同,评价准则不同。
     字符串(函数名),或是可调用对象,需要其函数签名形如:scorer(estimator, X, y);如果是None,则使用estimator的误差估计函数。

it_params : dict, optional          # Parameters to pass to the fit method.
n_jobs : int, default=1            # Number of jobs to run in parallel.

pre_dispatch : int, or string, optional   # Controls the number of jobs that get dispatched during parallel execution.
    Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than
    CPUs can process. This parameter can be:
    None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs,
to avoid delays due to on-demand spawning of the jobs.
    An int, giving the exact number of total jobs that are spawned.
    A string, giving an expression as a function of n_jobs, as in ‘2*n_jobs’.

iid : boolean, default=True       #If True, the data is assumed to be identically distributed across the folds,
and the loss minimized is the total loss per sample, and not the mean loss across the folds.

cv :交叉验证参数,默认None,使用三折交叉验证。指定fold数量,默认为3,也可以是yield训练/测试数据的生成器。

refit : boolean, or string, default=True     # Refit an estimator using the best found parameters on the whole dataset.

verbose : integer.                    # Controls the verbosity: the higher, the more messages.

error_score : ‘raise’ (default) or numeric.  # Value to assign to the score if an error occurs in estimator fitting.
If set to ‘raise’, the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does
not affect the refit step, which will always raise the error.

return_train_score : boolean, optional.      # If False, the cv_results_ attribute will not include training scores.

属性(Attributes):

cv_results_ : dict of numpy (masked) ndarrays   # A dict with keys as column headers and values as columns,
that can be imported into a pandas DataFrame.

best_estimator_ : estimator or dict.        # Estimator that was chosen by the search, i.e. estimator which gave highest score
(or smallest loss if specified) on the left out data. Not available if refit=False.

best_score_ : float.     # Mean cross-validated score of the best_estimator.

best_params_ : dict.     # Parameter setting that gave the best results on the hold out data.

best_index_ : int.       # The index (of the cv_results_ arrays) which corresponds to the best candidate parameter setting.

scorer_ : function or a dict.  # Scorer function used on the held out data to choose the best parameters for the model.

n_splits_ : int.        # The number of cross-validation splits (folds/iterations).

方法(Methods):

decision_function(X)              # Call decision_function on the estimator with the best found parameters.
fit(X[, y, groups])              # Run fit with all sets of parameters.
get_params([deep])              # Get parameters for this estimator.
inverse_transform(Xt)            # Call inverse_transform on the estimator with the best found params.
predict(X)                   # Call predict on the estimator with the best found parameters.
predict_log_proba(X)             # Call predict_log_proba on the estimator with the best found parameters.
predict_proba(X)                # Call predict_proba on the estimator with the best found parameters.
score(X[, y])                  # Returns the score on the given data, if the estimator has been refit.
set_params(**params)              #Set the parameters of this estimator.
transform(X)                   # Call transform on the estimator with the best found parameters.

2.2.5 confusion_matrix(混淆矩阵)

confusion_matrix 函数通过计算混淆矩阵来评估分类的准确性。

sklearn.metrics.confusion_matrix(y_true, y_pred, labels=None, sample_weight=None)

参数解释(Parameters):

y_true : array, shape = [n_samples]             # Ground truth (correct) target values.

y_pred : array, shape = [n_samples]             # Estimated targets as returned by a classifier.

labels : array, shape = [n_classes], optional.      # List of labels to index the matrix. This may be used to reorder or
select a subset of labels. If none is given, those that appear at least once in y_true or y_pred are used in sorted order.

sample_weight : array-like of shape = [n_samples], optional.     # Sample weights

Returns:

C : array, shape = [n_classes, n_classes]   # Confusion matrix