marcino239.github.io

home archive about

Classification with logistic regression

August 07, 2015

Let's take a look at a classification problem using few common regression techniques. We will start with logistic regression.

In [1]:
%matplotlib inline

import numpy as np

import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
from sklearn.cross_validation import train_test_split, cross_val_score

digits = datasets.load_digits()

# split dataset into train and test
X_train, X_test, y_train, y_test = train_test_split( digits.data, 
                                                    digits.target,
                                                    test_size=0.33 )

Cs = np.logspace(-4., 4., 30)

scores = list()
scores_std = list()

# logistic regression
logreg = linear_model.LogisticRegression()

for C in Cs:
    logreg.C = C
    this_scores = cross_val_score( logreg, X_train, y_train )
    scores.append(np.mean(this_scores))
    scores_std.append(np.std(this_scores))

fig, ax1 = plt.subplots( figsize=(10, 6) )    
ax1.semilogx( Cs, scores )
# plot error lines showing +/- std. errors of the scores
ax1.semilogx( Cs, np.array(scores) + np.array(scores_std) 
                                     / np.sqrt(len(X_train)), 'b--' )
ax1.semilogx( Cs, np.array(scores) - np.array(scores_std) 
                                     / np.sqrt(len(X_train)), 'b--' ) 
ax1.set_ylabel( 'CV score' )
ax1.set_xlabel( 'C' )
ax1.axhline( np.max(scores), linestyle='--', color='.5' )

# show scores_std on a separate axis
ax2 = ax1.twinx()
ax2.semilogx( Cs, scores_std, 'r-' )
ax2.set_ylabel( 'scores_std', color='r' )

plt.show()

CV score and std dev of scores don't look too bad. Let's take a look how C and score changes depending on the fold. C is averaged over the labels that Logistic Regresion was fit in.

In [2]:
from sklearn.cross_validation import KFold

logreg_cv = linear_model.LogisticRegressionCV( Cs=Cs )
k_fold = KFold( len(X_train), 5 )

for k, (train, test) in enumerate( k_fold ):
    logreg_cv.fit( X_train[ train ], y_train[ train ] )
    print("[fold {0}] C: {1:.5f}, score: {2:.5f}".format( k, 
                        np.average( logreg_cv.C_ ),
                        logreg_cv.score( X_train[test], y_train[ test ] )))
[fold 0] C: 43.51255, score: 0.95851
[fold 1] C: 12.02224, score: 0.97095
[fold 2] C: 48.02748, score: 0.95851
[fold 3] C: 6.28055, score: 0.97083
[fold 4] C: 0.08362, score: 0.93750

The scores are clustered around 0.95. The C parameter varies, but no major concern at this stage. Let's calibrate C over entire train set

In [3]:
logreg_cv.fit( X_train, y_train )

print( 'C:{0}, score:{1}'.format( np.average( logreg_cv.C_ ),
                            logreg_cv.score( X_train, y_train ) ) )
C:41.8084380786, score:0.989193682461

We have a calibrated model, but how does it perform on the test set?

In [4]:
print( 'Test score:{0}'.format( logreg_cv.score( X_test, y_test ) ) )
Test score:0.973063973064