Classification with logistic regression
August 07, 2015Let's take a look at a classification problem using few common regression techniques. We will start with logistic regression.
In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
from sklearn.cross_validation import train_test_split, cross_val_score
digits = datasets.load_digits()
# split dataset into train and test
X_train, X_test, y_train, y_test = train_test_split( digits.data,
digits.target,
test_size=0.33 )
Cs = np.logspace(-4., 4., 30)
scores = list()
scores_std = list()
# logistic regression
logreg = linear_model.LogisticRegression()
for C in Cs:
logreg.C = C
this_scores = cross_val_score( logreg, X_train, y_train )
scores.append(np.mean(this_scores))
scores_std.append(np.std(this_scores))
fig, ax1 = plt.subplots( figsize=(10, 6) )
ax1.semilogx( Cs, scores )
# plot error lines showing +/- std. errors of the scores
ax1.semilogx( Cs, np.array(scores) + np.array(scores_std)
/ np.sqrt(len(X_train)), 'b--' )
ax1.semilogx( Cs, np.array(scores) - np.array(scores_std)
/ np.sqrt(len(X_train)), 'b--' )
ax1.set_ylabel( 'CV score' )
ax1.set_xlabel( 'C' )
ax1.axhline( np.max(scores), linestyle='--', color='.5' )
# show scores_std on a separate axis
ax2 = ax1.twinx()
ax2.semilogx( Cs, scores_std, 'r-' )
ax2.set_ylabel( 'scores_std', color='r' )
plt.show()
CV score and std dev of scores don't look too bad. Let's take a look how C and score changes depending on the fold. C is averaged over the labels that Logistic Regresion was fit in.
In [2]:
from sklearn.cross_validation import KFold
logreg_cv = linear_model.LogisticRegressionCV( Cs=Cs )
k_fold = KFold( len(X_train), 5 )
for k, (train, test) in enumerate( k_fold ):
logreg_cv.fit( X_train[ train ], y_train[ train ] )
print("[fold {0}] C: {1:.5f}, score: {2:.5f}".format( k,
np.average( logreg_cv.C_ ),
logreg_cv.score( X_train[test], y_train[ test ] )))
The scores are clustered around 0.95. The C parameter varies, but no major concern at this stage. Let's calibrate C over entire train set
In [3]:
logreg_cv.fit( X_train, y_train )
print( 'C:{0}, score:{1}'.format( np.average( logreg_cv.C_ ),
logreg_cv.score( X_train, y_train ) ) )
We have a calibrated model, but how does it perform on the test set?
In [4]:
print( 'Test score:{0}'.format( logreg_cv.score( X_test, y_test ) ) )
View Comments