1.4 示例数据
1.4.1 模拟数据集
在sklearn的datasets模块下,有许多常用的模拟数据集,用来控制生成的数据的分布情况:
[1]:
from sklearn import datasets
常用的有以下三个数据集:
[2]:
n_samples = 1500
noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5,
noise=.05)
noisy_moons = datasets.make_moons(n_samples=n_samples, noise=.05)
blobs = datasets.make_blobs(centers= 2,n_samples=n_samples, random_state=8)
[3]:
toy_datasets = [noisy_circles, noisy_moons, blobs]
常用的参数有三个: * n_samples控制生成样本的数量; * noise控制叠加的噪声的大小; * random_state使得每一次生成的数据一致,保持不变。
对应的参数详细解释参考如下链接:
http://scikit-learn.org/stable/datasets/index.html#datasets
将这几个数据集可视化出来,如下所示:
[6]:
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from matplotlib.colors import ListedColormap
[7]:
cm = plt.cm.RdBu
cm_bright = ListedColormap(['Red', 'Blue'])
plt.figure(figsize=(len(toy_datasets) * 2 + 3, 9.5))
for plot_num,data in enumerate(toy_datasets):
X, y = data
if data is not blobs:
X = StandardScaler().fit_transform(X)
ax = plt.subplot(1,len(toy_datasets), plot_num+1)
ax.scatter(X[:, 0], X[:, 1], c=y, cmap=cm_bright)
ax.set_aspect('equal', 'datalim')
plt.xticks(())
plt.yticks(())
1.4.2 数据预处理
(1) 代码环境配置
图形行内显示,显示为矢量格式:
[1]:
%pylab inline
%config InlineBackend.figure_format = 'retina'
Populating the interactive namespace from numpy and matplotlib
导入所需要的工具:
[2]:
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.style.use('ggplot')
如果没有安装seaborn,使用以下命令安装:
conda install seaborn
(2) 读取数据
从 seaborn 模块中加载 Iris 数据集:
[3]:
iris = sns.load_dataset('iris')
iris.head()
[3]:
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
sklearn 本身也是自带 iris 数据集的,也可以使用 sklearn 自带的 Iris 数据集:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
1.4.3 数据清洗
(1) 检查缺失值
查看是否含有缺失值:
[4]:
iris.isnull().sum()
[4]:
sepal_length 0
sepal_width 0
petal_length 0
petal_width 0
species 0
dtype: int64
(2) 数据统计
总览特征数据,查看统计特征:
[5]:
iris.describe()
[5]:
sepal_length | sepal_width | petal_length | petal_width | |
---|---|---|---|---|
count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
mean | 5.843333 | 3.057333 | 3.758000 | 1.199333 |
std | 0.828066 | 0.435866 | 1.765298 | 0.762238 |
min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
总览类别数据,看一下一共有多少个类别,每个类别都包含多少个数据:
[6]:
iris["species"].value_counts()
[6]:
setosa 50
virginica 50
versicolor 50
Name: species, dtype: int64
各个类别均匀分布。
(3) 数据初步可视化
根据类别,特征两两组合在特征空间可视化:
[7]:
sns.pairplot(iris, hue="species", size=3)
[7]:
<seaborn.axisgrid.PairGrid at 0x112130eb8>
对角线上的直方图是单独一种特征下的频率分布。从特征两两组合的图中,可以看出 Iris-setosa 这个种类的花(红色)可以被任何一种的特征组合分离出来。
(4) 数据转换
分别读取为输入数据和类别数据:
[8]:
from sklearn.datasets import load_iris
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
1.4.4 模型调用
使用很小的一部分数据来检验数据是否正确、模型是否正确运行:
[9]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from ipywidgets import interact,interact_manual
[10]:
from ipywidgets import interact_manual
@interact_manual
def sanity_check(train_size: (0, 1, 0.05)=0.04):
X_train, X_test, y_train, y_test = train_test_split(
X_iris,
y_iris,
train_size=train_size,
random_state=0,
stratify=y_iris)
model = LogisticRegression()
model.fit(X_train, y_train)
accuracy = model.score(X_train, y_train)
print(y_train)
print('使用比例为{0:.2}的'\
'数据作为训练数据,正确率为{1:.2f}'.format(train_size, accuracy))
[2 2 0 0 1 1 2]
使用比例为0.05的数据作为训练数据,正确率为0.71
(1) 默认参数的精度
使用 50% 的数据作为训练数据, 25%的数据作为验证数据, 25%的数据作为测试数据:
[11]:
X_train, X_test, y_train, y_test = train_test_split(
X_iris,
y_iris,
train_size=0.75,
random_state= 42,
stratify=y_iris)
[12]:
X_train, X_valid, y_train, y_valid = train_test_split(
X_train,
y_train,
train_size=0.66,
random_state= 0,
stratify=y_train)
调用默认参数进行分类:
[13]:
model = LogisticRegression()
model.fit(X_train, y_train)
[13]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
[14]:
model.score(X_train, y_train)
[14]:
0.9726027397260274
[15]:
model.score(X_valid, y_valid)
[15]:
0.94871794871794868
(2) 搜索超参数
L2 penalized logistic regression
holdout method 进行超参数搜索:
[16]:
X_train, X_test, y_train, y_test = train_test_split(
X_iris,
y_iris,
train_size=0.75,
random_state= 42,
stratify=y_iris)
也可以使用 sklearn.linear_model.LogisticRegressionCV
,也可以用 validation_curve
来获取一系列的精度值:
[17]:
from sklearn.model_selection import validation_curve
model = LogisticRegression()
param_range = np.logspace(-2, 5, 60)
train_scores, test_scores = validation_curve(
estimator=model,
X=X_train,
y=y_train,
param_name='C',
param_range=param_range,
cv=5)
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
plt.plot(
param_range,
train_mean,
color='blue',
marker='o',
markersize=5,
label='training accuracy')
plt.fill_between(
param_range,
train_mean + train_std,
train_mean - train_std,
alpha=0.15,
color='blue')
plt.plot(
param_range,
test_mean,
color='green',
linestyle='--',
marker='s',
markersize=5,
label='validation accuracy')
plt.fill_between(
param_range,
test_mean + test_std,
test_mean - test_std,
alpha=0.15,
color='green')
plt.grid()
plt.xscale('log')
plt.legend(loc='lower right')
plt.xlabel('Parameter C')
plt.ylabel('Accuracy')
plt.ylim([0.8, 1.0])
plt.tight_layout()
可以根据图形曲线进一步更小范围的精确搜索。
(3) 更小范围内的网格搜索
[18]:
from sklearn.model_selection import GridSearchCV
param_grid = {'C': np.arange(0.1,10,0.1),
'penalty':['l2','l1']
}
grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
[19]:
grid.fit(X_train,y_train);
[20]:
grid.best_params_
[20]:
{'C': 0.90000000000000002, 'penalty': 'l2'}
[21]:
grid.best_score_
[21]:
0.9642857142857143
[22]:
model = grid.best_estimator_
1.4.5 精度评价
(1) 精度
[23]:
y_model = model.predict(X_test)
accuracy_score(y_test, y_model)
[23]:
0.89473684210526316
(2) 混淆矩阵
[24]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(y_test, y_model)
sns.heatmap(mat, square=True, annot=True,fmt='d', cbar=False, xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel('predicted value')
plt.ylabel('true value');
/Users/yangnaisen/anaconda/lib/python3.6/site-packages/seaborn/matrix.py:143: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
if xticklabels == []:
/Users/yangnaisen/anaconda/lib/python3.6/site-packages/seaborn/matrix.py:151: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
if yticklabels == []: