Bag Of Words
September 01, 2015In today's post we will take a look at the NLP classification task. One of the simpler algorithms is Bag-Of-Words. Each word is one-hot encoded, then the words of a document are averaged and put through the classifier.
As a dataset we are going to use movie reviews which can be downloaded from Kaggle. A word of disclaimer: the code below is partially based on the sklearn tutorial as well as on very good NLP course CS224d from Stanford University
%matplotlib inline
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from bs4 import BeautifulSoup
from matplotlib import pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.metrics import roc_curve, auc
# load data
df = pd.read_csv( 'labeledTrainData.tsv', sep='\t' )
# convert the data
df.insert( 3, 'converted', df.iloc[ :, 2 ].apply( lambda x: BeautifulSoup( x ).get_text() ) )
print( 'available columns: {0}'.format( df.columns ) )
# train test / ratio of 0.66
tt_index = np.random.binomial( 1, 0.66, size=df.shape[0] )
train = df[ tt_index == 1 ]
test = df[ tt_index == 0 ]
vectorizer = TfidfVectorizer( encoding='latin1' )
vectorizer.fit_transform( train.iloc[ :, 3 ] )
# prepare data
X_train = vectorizer.transform( train.iloc[ :, 3 ] )
y_train = train.iloc[ :, 1 ]
X_test = vectorizer.transform( test.iloc[ :, 3 ] )
y_test = test.iloc[ :, 1 ]
# let's take a look how input classes are distributed.
# Having more or less equall frequency will help predictor training
train.hist( column=(1) )
plt.show()
ch2 = SelectKBest(chi2, k=100 )
X_train = ch2.fit_transform( X_train, y_train ).toarray()
X_test = ch2.transform( X_test ).toarray()
# we're going to use Gradient Boosted Tree classifier. These methods showed good performance on few Kaggle competitions
clf = GradientBoostingClassifier( n_estimators=100, learning_rate=1.0, max_depth=5, random_state=0 )
clf.fit(X_train, y_train)
y_score = clf.decision_function( X_test )
fpr, tpr, thresholds = roc_curve( y_test.ravel(), y_score.ravel() )
roc_auc = auc( fpr, tpr )
# Plot Precision-Recall curve
plt.clf()
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc )
plt.legend(loc="lower right")
plt.show()
AUC of Receiver Operating Characteristic curve shows 0.87 which is a decent margin vs random binary classification. Given we have two classes: good review or bad review, it make sense to try a linear hyperplane classifier: SVM.
from sklearn.svm import SVC
clf2 = SVC( kernel='linear' )
clf2.fit( X_train, y_train )
y_score = clf2.decision_function( X_test )
fpr, tpr, thresholds = roc_curve( y_test.ravel(), y_score.ravel() )
roc_auc = auc( fpr, tpr )
# Plot Precision-Recall curve
plt.clf()
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc )
plt.legend(loc="lower right")
plt.show()
A bit better, although I am not too happy with the 'scientifc' method here. Better choice will be to use a parameter grid search over dev set defined by cross validation method, but I'll reserve this for the next post. For a moment this should suffice.
Let's take a look where the classifier fails on a random data sample.
y_pred = clf2.predict( X_test )
y_pred[ 0:10 ]
y_test[ 0:10 ]
test.iloc[ 1, 3 ]
Hymm - lot's of double negations? Perhaps this is were bag of words fails?