1.4 示例数据

1.4.1 模拟数据集

在sklearn的datasets模块下,有许多常用的模拟数据集,用来控制生成的数据的分布情况:

[1]:
from sklearn import datasets

常用的有以下三个数据集:

[2]:
n_samples = 1500

noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5,
                                      noise=.05)

noisy_moons = datasets.make_moons(n_samples=n_samples, noise=.05)

blobs = datasets.make_blobs(centers= 2,n_samples=n_samples, random_state=8)
[3]:
toy_datasets = [noisy_circles, noisy_moons, blobs]

常用的参数有三个: * n_samples控制生成样本的数量; * noise控制叠加的噪声的大小; * random_state使得每一次生成的数据一致,保持不变。

对应的参数详细解释参考如下链接:

http://scikit-learn.org/stable/datasets/index.html#datasets

将这几个数据集可视化出来,如下所示:

[6]:
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from matplotlib.colors import ListedColormap
[7]:
cm = plt.cm.RdBu
cm_bright = ListedColormap(['Red', 'Blue'])

plt.figure(figsize=(len(toy_datasets) * 2 + 3, 9.5))

for plot_num,data in enumerate(toy_datasets):

    X, y = data

    if data is not blobs:
        X = StandardScaler().fit_transform(X)


    ax = plt.subplot(1,len(toy_datasets), plot_num+1)

    ax.scatter(X[:, 0], X[:, 1], c=y, cmap=cm_bright)
    ax.set_aspect('equal', 'datalim')


    plt.xticks(())
    plt.yticks(())
../../_images/Introduction_1.Introduction_1.4ClassificationSample_10_0.png

1.4.2 数据预处理

(1) 代码环境配置

图形行内显示,显示为矢量格式:

[1]:
%pylab inline
%config InlineBackend.figure_format = 'retina'
Populating the interactive namespace from numpy and matplotlib

导入所需要的工具:

[2]:
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

mpl.style.use('ggplot')

如果没有安装seaborn,使用以下命令安装:

conda install seaborn

(2) 读取数据

从 seaborn 模块中加载 Iris 数据集:

[3]:
iris = sns.load_dataset('iris')
iris.head()
[3]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

sklearn 本身也是自带 iris 数据集的,也可以使用 sklearn 自带的 Iris 数据集:

from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

1.4.3 数据清洗

(1) 检查缺失值

查看是否含有缺失值:

[4]:
iris.isnull().sum()
[4]:
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

(2) 数据统计

总览特征数据,查看统计特征:

[5]:
iris.describe()
[5]:
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

总览类别数据,看一下一共有多少个类别,每个类别都包含多少个数据:

[6]:
iris["species"].value_counts()
[6]:
setosa        50
virginica     50
versicolor    50
Name: species, dtype: int64

各个类别均匀分布。

(3) 数据初步可视化

根据类别,特征两两组合在特征空间可视化:

[7]:
sns.pairplot(iris, hue="species", size=3)
[7]:
<seaborn.axisgrid.PairGrid at 0x112130eb8>
../../_images/Introduction_1.Introduction_1.4ClassificationSample_34_1.png

对角线上的直方图是单独一种特征下的频率分布。从特征两两组合的图中,可以看出 Iris-setosa 这个种类的花(红色)可以被任何一种的特征组合分离出来。

(4) 数据转换

分别读取为输入数据和类别数据:

[8]:
from sklearn.datasets import load_iris

iris = load_iris()
X_iris = iris.data
y_iris = iris.target

1.4.4 模型调用

使用很小的一部分数据来检验数据是否正确、模型是否正确运行:

[9]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from ipywidgets import interact,interact_manual
[10]:
from ipywidgets import interact_manual
@interact_manual
def sanity_check(train_size: (0, 1, 0.05)=0.04):
    X_train, X_test, y_train, y_test = train_test_split(
        X_iris,
        y_iris,
        train_size=train_size,
        random_state=0,
        stratify=y_iris)
    model = LogisticRegression()
    model.fit(X_train, y_train)
    accuracy = model.score(X_train, y_train)
    print(y_train)
    print('使用比例为{0:.2}的'\
          '数据作为训练数据,正确率为{1:.2f}'.format(train_size, accuracy))
[2 2 0 0 1 1 2]
使用比例为0.05的数据作为训练数据,正确率为0.71

(1) 默认参数的精度

使用 50% 的数据作为训练数据, 25%的数据作为验证数据, 25%的数据作为测试数据:

[11]:
X_train, X_test, y_train, y_test = train_test_split(
        X_iris,
        y_iris,
        train_size=0.75,
        random_state= 42,
        stratify=y_iris)
[12]:
X_train, X_valid, y_train, y_valid = train_test_split(
        X_train,
        y_train,
        train_size=0.66,
        random_state= 0,
        stratify=y_train)

调用默认参数进行分类:

[13]:
model = LogisticRegression()
model.fit(X_train, y_train)
[13]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
[14]:
model.score(X_train, y_train)
[14]:
0.9726027397260274
[15]:
model.score(X_valid, y_valid)
[15]:
0.94871794871794868

(2) 搜索超参数

L2 penalized logistic regression

\[\underset{w, c}{min\,} \frac{1}{2}w^T w + C \sum_{i=1}^n \log(\exp(- y_i (X_i^T w + c)) + 1) .\]

holdout method 进行超参数搜索:

[16]:
X_train, X_test, y_train, y_test = train_test_split(
        X_iris,
        y_iris,
        train_size=0.75,
        random_state= 42,
        stratify=y_iris)

也可以使用 sklearn.linear_model.LogisticRegressionCV ,也可以用 validation_curve 来获取一系列的精度值:

[17]:
from sklearn.model_selection import validation_curve

model = LogisticRegression()
param_range = np.logspace(-2, 5, 60)

train_scores, test_scores = validation_curve(
    estimator=model,
    X=X_train,
    y=y_train,
    param_name='C',
    param_range=param_range,
    cv=5)

train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

plt.plot(
    param_range,
    train_mean,
    color='blue',
    marker='o',
    markersize=5,
    label='training accuracy')

plt.fill_between(
    param_range,
    train_mean + train_std,
    train_mean - train_std,
    alpha=0.15,
    color='blue')

plt.plot(
    param_range,
    test_mean,
    color='green',
    linestyle='--',
    marker='s',
    markersize=5,
    label='validation accuracy')

plt.fill_between(
    param_range,
    test_mean + test_std,
    test_mean - test_std,
    alpha=0.15,
    color='green')

plt.grid()
plt.xscale('log')
plt.legend(loc='lower right')
plt.xlabel('Parameter C')
plt.ylabel('Accuracy')
plt.ylim([0.8, 1.0])
plt.tight_layout()

../../_images/Introduction_1.Introduction_1.4ClassificationSample_56_0.png

可以根据图形曲线进一步更小范围的精确搜索。

(3) 更小范围内的网格搜索

[18]:
from sklearn.model_selection import GridSearchCV

param_grid = {'C': np.arange(0.1,10,0.1),
               'penalty':['l2','l1']
             }
grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
[19]:
grid.fit(X_train,y_train);
[20]:
grid.best_params_
[20]:
{'C': 0.90000000000000002, 'penalty': 'l2'}
[21]:
grid.best_score_
[21]:
0.9642857142857143
[22]:
model = grid.best_estimator_

1.4.5 精度评价

(1) 精度

[23]:
y_model = model.predict(X_test)
accuracy_score(y_test, y_model)
[23]:
0.89473684210526316

(2) 混淆矩阵

[24]:
from sklearn.metrics import confusion_matrix

mat = confusion_matrix(y_test, y_model)

sns.heatmap(mat, square=True, annot=True,fmt='d', cbar=False, xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel('predicted value')
plt.ylabel('true value');
/Users/yangnaisen/anaconda/lib/python3.6/site-packages/seaborn/matrix.py:143: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if xticklabels == []:
/Users/yangnaisen/anaconda/lib/python3.6/site-packages/seaborn/matrix.py:151: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if yticklabels == []:
../../_images/Introduction_1.Introduction_1.4ClassificationSample_68_1.png