Torch And Linear Regression On Gpu

September 10, 2015

There are few packages which can help with implementation of RNNs and their need for high performance calculations. I like caffe the most but it can be chalenging especially when it comes to adding new code as you need to deal with C++ and Cuda. There is also Theano, but I am not a great fun with heavy computational tree optimisation especially during evaluation stage. There is also Torch based on Lua which is ... well I don't know what Torch can do at this stage ...

Hence this post will be about implementing linear regression using Cuda Tensors and Torch 7. The example below is loosely based on Torch 7 and iTorch demos.

In [2]:
require 'cutorch';
require 'cunn';
require 'optim';

torch.setdefaulttensortype( 'torch.FloatTensor' )
logger = optim.Logger( paths.concat('.', 'train.log') )

For this exercise we will use fairly large table

In [3]:
x_len = 1000000
x_width = 2

X = torch.CudaTensor( x_len, x_width ):normal()
A = torch.CudaTensor{ {1}, {2} }
Y = X, A ) + torch.CudaTensor( x_len, 1 ):normal( 3.0, 1.0 )

Let's define linear layer to express our regression. NN package will take care of gradient derivation as well as forward and backward passes

In [4]:
lin_layer = nn.Linear( (#X)[2], (#Y)[2] )
model = nn.Sequential()
model:add( lin_layer )
criterion = nn.MSECriterion()
params, dl_dparams = model:getParameters()
In [5]:
sgd_params = {
    learningRate = 1e-3,
    learningRateDecay = 1e-4,
    weightDecay = 0,
    momentum = 0
epochs = 100
batch_size = 50000
In [6]:
function train( X, Y )
    local current_loss = 0

    -- mini input / target
    local inputs = torch.CudaTensor( batch_size, x_width )
    local targets = torch.CudaTensor( batch_size )
    -- we won't use shuffle over here as for loop is too slow in lua
    -- instead we will start from a random offset
    local offset = math.floor( torch.uniform( 0, batch_size-1 ) )
    -- for each mini batch
    for t = 1,(#X)[1], batch_size do

        local x_start = t + offset
        local x_end = math.min( t + offset + batch_size - 1, (#X)[1] )
        inputs[  ] = X[  ]:clone()
        targets[  ] = Y[  ]:clone()
        -- eval function to minimise 
        feval = function( params_new )
            -- clean up 

            if params ~= params_new then
                params:copy( params_new )

            -- reset gradients (gradients are always accumulated, to accomodate batch methods)

            -- evaluate the loss function and its derivative wrt x
            local outputs = model:forward( inputs )
            local loss = criterion:forward( outputs, targets )
            local backprop = criterion:backward( outputs, targets )
            model:backward( inputs, backprop )

            -- return loss and dloss/dparams
            return loss, dl_dparams

        -- run SGD
        _, fs = optim.sgd( feval, params, sgd_params )
        current_loss = current_loss + fs[1]
    current_loss = current_loss / batch_size
    logger:add{['training_error'] = current_loss }
    return current_loss
In [7]:
time = sys.clock()
local cumm_loss = 0.
for i = 1, epochs do
    cumm_loss = train( X, Y )

print( 'Final loss = ' .. cumm_loss )

-- time taken
time = sys.clock() - time
print( "Time per epoch = " .. (time / epochs) .. '[s]')
Final loss = 0.00040390299797058   
Time per epoch = 0.15690538883209[s]    

Let's take a look at recovered parameters. They should be close to matrix A + mean of noise ( 3 ):

In [8]:
print( params )

[torch.CudaTensor of size 3]

Not bad. Here's the chart of MSE as a function of epoch

In [9]:
Plot = require 'itorch.Plot'

for name, list in pairs( logger.symbols ) do
    y = torch.Tensor( list )
    x = torch.linspace( 1, #list, #list )
    plot = Plot():line( x, y ,'blue', name ):legend(true):title('MSE'):draw()

Word Vectors

September 05, 2015

Word vectors is a cool idea to pack word information in to an $R^n$ vector. The difference from the bag of words method is that $n$ is smaller than the size of dictionary. Aditionally as a side effect of optimisation word vectors have very interesting features.

Word vectors are result of maximisation of probability of finding word in a specifc context:

$$ J( \theta ) = \frac{1}{T} \sum_{i=1}^T \sum_{-c\le j\le c, j\ne 0} \log p( w_{i+j} \mid w_i ) $$

The probability function $ p( w_{i+j} \mid w_i ) $ can take many forms, however the original paper uses a form of soft max [1]:

$$ p( w_{i+j} \mid w_i ) = \frac{ e^{ v_{w_o}^T v_{w_i} } }{ \sum_{w=1}^W e^{ v_{w_o}^T v_{w_i} } } $$

There are couple of problems when searching parameters for an above function as computing the denominator is expensive and created matrix can be very large depending on number of words in the dictionary. For this reason we will use gensim package that wraps the search algorithm [2].

This code is partially based on Kaggle's Bag of Words meets Bag of Popcorn [3]

1 - Efficient Estimation of Word Representations in Vector Space
2 - gensim: Topic modelling for humans - Radim Řehůřek
3 - Word Vectors

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import gensim
import logging
import re

from bs4 import BeautifulSoup
from matplotlib import pyplot as plt
from nltk.corpus import stopwords
In [2]:
# load data
df = pd.read_csv( 'labeledTrainData.tsv', header=0, delimiter='\t', quoting=3 )
df_unlabelled = pd.read_csv( 'unlabeledTrainData.tsv', header=0, delimiter='\t', quoting=3 )

# train test / ratio of 0.66
tt_index = np.random.binomial( 1, 0.66, size=df.shape[0] )
train = df[ tt_index == 1 ]
test = df[ tt_index == 0 ]

print( 'train shape: {0}'.format( train.shape ) )
print( 'test shape: {0}'.format( test.shape ) )
print( 'unlabelled shape: {0}'.format( df_unlabelled.shape ) )
train shape: (16554, 3)
test shape: (8446, 3)
unlabelled shape: (50000, 2)
In [3]:
# borrowed from Kaggle
def review_to_wordlist( review, remove_stopwords=False ):
    # remove HTML
    review_text = BeautifulSoup(review).get_text()
    review_text = re.sub("[^a-zA-Z]"," ", review_text)
    words = review_text.lower().split()
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]
    return ( words )

# Define a function to split a review into parsed sentences
def review_to_sentences( review, tokenizer, remove_stopwords=False ):
    raw_sentences = tokenizer.tokenize( review.strip().decode('utf8', 'ignore') )
    sentences = []
    for raw_sentence in raw_sentences:
        if len(raw_sentence) > 0:
            sentences.append( review_to_wordlist( raw_sentence, remove_stopwords ))
    return sentences
In [4]:
# Load the punkt tokenizer
tokenizer ='tokenizers/punkt/english.pickle')

sentences = []
for review in train[ 'review' ]:
    sentences += review_to_sentences(review, tokenizer)

for review in df_unlabelled[ 'review' ]:
    sentences += review_to_sentences(review, tokenizer)
In [5]:
logging.basicConfig( format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO )

num_features = 300    # dimensionality                      
min_word_count = 40   # minimum word count                        
num_workers = 6       # number of threads to run in parallel
context = 10          # context window size                                                                                    
downsampling = 1e-3   # downsample setting for frequent words

model = gensim.models.Word2Vec(sentences, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)

# free memory
model.init_sims( replace=True )

# save model
model_name = "300features_40minwords_10context"

# stat
print( 'total run time: {0} [s]'.format( model.total_train_time ) )
total run time: 492.933614016 [s]
In [6]:
nan_words = {}

def makeFeatureVec( words, model, num_features, index2word_set ):
    featureVec = np.zeros((num_features,),dtype="float32")
    nwords = 0.

    for word in words:
        if word in index2word_set: 
            nwords = nwords + 1.
            if np.isnan( model[ word ] ).any():
                if word in nan_words:
                    nan_words[ word ] += 1
                    nan_words[ word ] = 1
            featureVec = np.add(featureVec,model[word])
    if nwords != 0:
        featureVec = np.divide(featureVec,nwords)

    return featureVec

def getAvgFeatureVecs(reviews, model, num_features, index2word_set ):
    counter = 0.
    reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")

    for review in reviews:
       if counter % 1000 == 0.:
           print "Review %d of %d" % (counter, len(reviews))
       reviewFeatureVecs[counter] = makeFeatureVec(review, model, num_features, index2word_set )
       counter = counter + 1.

    return reviewFeatureVecs
In [7]:
index2word_set = set( model.index2word )

clean_train_reviews = []
for review in train[ 'review' ]:
    clean_train_reviews.append( review_to_wordlist( review, remove_stopwords=True ) )

trainDataVecs = getAvgFeatureVecs( clean_train_reviews, model, num_features, index2word_set )

clean_test_reviews = []
for review in test[ 'review' ]:
    clean_test_reviews.append( review_to_wordlist( review, remove_stopwords=True ) )

testDataVecs = getAvgFeatureVecs( clean_test_reviews, model, num_features, index2word_set )
In [16]:
from sklearn.grid_search import GridSearchCV
import xgboost as xgb

x_params = { 'max_depth': [ 4, 8, 12 ],
             'n_estimators': [ 200, 500, 1000 ],
             'objective': [ 'binary:logistic' ],

xgb_model = xgb.XGBClassifier()

clf = GridSearchCV(xgb_model, x_params, verbose=1, n_jobs=1) trainDataVecs, train[ 'sentiment' ] )

{'n_estimators': 1000, 'objective': 'binary:logistic', 'max_depth': 8}
In [17]:
score = clf.score( testDataVecs, test[ 'sentiment' ] )
print( 'score: {0}'.format( score ) )
score: 0.863840871418
In [37]:
from sklearn.metrics import roc_curve, auc

proba = clf.predict_proba( testDataVecs )[ :, 1 ]
fpr, tpr, thresholds = roc_curve( test[ 'sentiment' ].ravel(), proba.ravel() )
roc_auc = auc( fpr, tpr )
In [38]:
# Plot Precision-Recall curve
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])

plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc )
plt.legend(loc="lower right")

OK, AUC is 94% for word vectors vs 90% for simple bag of words. Not bad ... but let's take a look at few data examples to see what's going on under the hood.

In [59]:
diffs = np.where( clf.predict( testDataVecs ) != test[ 'sentiment' ] )[0]

# print 1st few differences
print( diffs[ 1:3 ] )
[14 19]
In [57]:
BeautifulSoup( test.iloc[ 14, 2 ] ).get_text()
u'"Okay, sorry, but I loved this movie. I just love the whole 80\'s genre of these kind of movies, because you don\'t see many like this one anymore! I want to ask all of you people who say this movie is just a rip-off, or a cheesy imitation, what is it imitating? I\'ve never seen another movie like this one, well, not horror anyway.Basically its about the popular group in school, who like to make everyones lives living hell, so they decided to pick on this nerdy boy named Marty. It turns fatal when he really gets hurt from one of their little pranks.So, its like 10 years later, and the group of friends who hurt Marty start getting High School reunion letters. But...they are the only ones receiving them! So they return back to the old school, and one by one get knocked off by.......Yeah you probably know what happens!The only part that disappointed me was the very end. It could have been left off, or thought out better.I think you should give it a try, and try not to be to critical!~*~CupidGrl~*~"'
In [58]:
BeautifulSoup( test.iloc[ 19, 2 ] ).get_text()
u'"A charming boy and his mother move to a middle of nowhere town, cats and death soon follow them. That about sums it up.I\'ll admit that I am a little freaked out by cats after seeing this movie. But in all seriousness in spite of the numerous things that are wrong with this film, and believe me there is plenty of that to go around, it is overall a very enjoyable viewing experience.The characters are more like caricatures here with only their basis instincts to rely on. Fear, greed, pride lust or anger seems to be all that motivate these people. Although it can be argued that that seeming failing, in actuality, serves the telling of the story. The supernatural premise and the fact that it is a Stephen King screenplay(not that I have anything specific against Mr. King) are quite nicely supported by some interesting FX work, makeup and quite suitable music. The absolute gem of this film is without a doubt Alice Krige who plays Mary Brady, the otherworldly mother.King manages to take a simple story of outsider, or people who are a little different(okay - a lot in this case), trying to fit in and twists it into a campy over the top little horror gem that has to be in the collection of any horror fan."'

Well, hard to say what is driving these errors. My bet is on double negations again, which simple classification methods will definitely strugle with.
For the next encounter with NLP we will bring heavy machinery: Recurrent Neural Networks and see how they would perform with this task.

Bag Of Words

September 01, 2015

In today's post we will take a look at the NLP classification task. One of the simpler algorithms is Bag-Of-Words. Each word is one-hot encoded, then the words of a document are averaged and put through the classifier.

As a dataset we are going to use movie reviews which can be downloaded from Kaggle. A word of disclaimer: the code below is partially based on the sklearn tutorial as well as on very good NLP course CS224d from Stanford University

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from bs4 import BeautifulSoup
from matplotlib import pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.metrics import roc_curve, auc
In [ ]:
# load data
df = pd.read_csv( 'labeledTrainData.tsv', sep='\t' )

# convert the data
df.insert( 3, 'converted', df.iloc[ :, 2 ].apply( lambda x: BeautifulSoup( x ).get_text() ) )
print( 'available columns: {0}'.format( df.columns ) )

# train test / ratio of 0.66
tt_index = np.random.binomial( 1, 0.66, size=df.shape[0] )
train = df[ tt_index == 1 ]
test = df[ tt_index == 0 ]
In [3]:
vectorizer = TfidfVectorizer( encoding='latin1' )
vectorizer.fit_transform( train.iloc[ :, 3 ] )
<16630x64833 sparse matrix of type '<type 'numpy.float64'>'
    with 2291206 stored elements in Compressed Sparse Row format>
In [4]:
# prepare data
X_train = vectorizer.transform( train.iloc[ :, 3 ] )
y_train = train.iloc[ :, 1 ]
X_test = vectorizer.transform( test.iloc[ :, 3 ] )
y_test = test.iloc[ :, 1 ]
In [5]:
# let's take a look how input classes are distributed.  
# Having more or less equall frequency will help predictor training  
train.hist( column=(1) )
In [6]:
ch2 = SelectKBest(chi2, k=100 )
X_train = ch2.fit_transform( X_train, y_train ).toarray()
X_test = ch2.transform( X_test ).toarray()
In [7]:
# we're going to use Gradient Boosted Tree classifier.  These methods showed good performance on few Kaggle competitions
clf = GradientBoostingClassifier( n_estimators=100, learning_rate=1.0, max_depth=5, random_state=0 ), y_train)
GradientBoostingClassifier(init=None, learning_rate=1.0, loss='deviance',
              max_depth=5, max_features=None, max_leaf_nodes=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              random_state=0, subsample=1.0, verbose=0, warm_start=False)
In [8]:
y_score = clf.decision_function( X_test )
fpr, tpr, thresholds = roc_curve( y_test.ravel(), y_score.ravel() )
roc_auc = auc( fpr, tpr )
In [11]:
# Plot Precision-Recall curve
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])

plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc )
plt.legend(loc="lower right")

AUC of Receiver Operating Characteristic curve shows 0.87 which is a decent margin vs random binary classification. Given we have two classes: good review or bad review, it make sense to try a linear hyperplane classifier: SVM.

In [13]:
from sklearn.svm import SVC
clf2 = SVC(  kernel='linear' ) X_train, y_train )

y_score = clf2.decision_function( X_test )
fpr, tpr, thresholds = roc_curve( y_test.ravel(), y_score.ravel() )
roc_auc = auc( fpr, tpr )

# Plot Precision-Recall curve
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])

plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc )
plt.legend(loc="lower right")

A bit better, although I am not too happy with the 'scientifc' method here. Better choice will be to use a parameter grid search over dev set defined by cross validation method, but I'll reserve this for the next post. For a moment this should suffice.

Let's take a look where the classifier fails on a random data sample.

In [14]:
y_pred = clf2.predict( X_test )
In [17]:
y_pred[ 0:10 ]
array([1, 0, 0, 0, 1, 1, 1, 1, 1, 1], dtype=int64)
In [18]:
y_test[ 0:10 ]
1     1
5     1
8     0
11    1
12    1
14    0
18    1
20    1
22    1
23    0
Name: sentiment, dtype: int64
In [19]:
test.iloc[ 1, 3 ]
u'I dont know why people think this is such a bad movie. Its got a pretty good plot, some good action, and the change of location for Harry does not hurt either. Sure some of its offensive and gratuitous but this is not the only movie like that. Eastwood is in good form as Dirty Harry, and I liked Pat Hingle in this movie as the small town cop. If you liked DIRTY HARRY, then you should see this one, its a lot better than THE DEAD POOL. 4/5'

Hymm - lot's of double negations? Perhaps this is were bag of words fails?