Buffalo Data Science Talk

I recently gave a talk at a Buffalo Data Science Meetup on Text Analytics in Python. It’s adapted from my post on Feature Extraction from Text with some added material and an example.

Intro to Text Analytics in Python

  • Terminology
  • Bag of Words Model
  • TF-IDF Model
  • Preprocessing and Hyperparameters
  • Example
  • N-gram model

Terminology

  • Document - a single string of text information

  • Corpus - a collection of documents

  • Token - a word, phrase or symbol derived from a document

  • Tokenizer - function to split a document into a list of tokens

# Example corpus
messages = ["Hey hey hey lets go get lunch today :)",
           "Did you go home?",
           "Hey!!! I need a favor"]
# Example document
document = messages[0]
document
'Hey hey hey lets go get lunch today :)'
# Creating tokens
document.split(' ')
['Hey', 'hey', 'hey', 'lets', 'go', 'get', 'lunch', 'today', ':)']

Bag of Words Model

  • need a numerical representation for our corpus
  • will use CountVectorizer() from sci-kit learn library
  • creates matrix of token counts
# import and instantiate CountVectorizer()
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
  • next we will use fit() and transform() methods
  • similar to fit() and predict() used in ML classifiers
vect.fit(messages)
CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
# before transforming look at feature names (columns names)
print vect.get_feature_names()
print 'Number of tokens: {}'.format(len(vect.get_feature_names()))
[u'did', u'favor', u'get', u'go', u'hey', u'home', u'lets', u'lunch', u'need', u'today', u'you']
Number of tokens: 11

Things to note:

  • all lowercase
  • words less than two letters are excluded
  • punctuation removed
  • no duplicates

Next, we’ll use the transform() method to create a document term matrix(DTM). This is the matrix of token counts we want to create.

dtm = vect.transform(messages)
repr(dtm)
"<3x11 sparse matrix of type '<type 'numpy.int64'>'\n\twith 13 stored elements in Compressed Sparse Row format>"
print dtm
  (0, 2)	1
  (0, 3)	1
  (0, 4)	3
  (0, 6)	1
  (0, 7)	1
  (0, 9)	1
  (1, 0)	1
  (1, 3)	1
  (1, 5)	1
  (1, 10)	1
  (2, 1)	1
  (2, 4)	1
  (2, 8)	1
  • Because each document has a column for every word that occurs in the corpus, DTM is predominatly filled with 0’s
  • Sparse format can store the DTM in a smaller amount of memory and can speed up operations
  • a DTM of a large corpus can quickly balloon in size
import pandas as pd
pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())
did favor get go hey home lets lunch need today you
0 0 0 1 1 3 0 1 1 0 1 0
1 1 0 0 1 0 1 0 0 0 0 1
2 0 1 0 0 1 0 0 0 1 0 0
# get total counts for corpus
pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names()).sum()
did      1
favor    1
get      1
go       2
hey      4
home     1
lets     1
lunch    1
need     1
today    1
you      1
dtype: int64

What happens if we get a new message?

new_message = ['Hey lets go get a drink tonight']
new_dtm = vect.transform(new_message)
pd.DataFrame(new_dtm.toarray(), columns=vect.get_feature_names())
did favor get go hey home lets lunch need today you
0 0 0 1 1 1 0 1 0 0 0 0
  • only tokens from original fit appear as features(columns)
  • need to refit with new message included
messages.append(new_message[0])
messages
['Hey hey hey lets go get lunch today :)',
 'Did you go home?',
 'Hey!!! I need a favor',
 'Hey lets go get a drink tonight']
dtm = vect.fit_transform(messages)
pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())
did drink favor get go hey home lets lunch need today tonight you
0 0 0 0 1 1 3 0 1 1 0 1 0 0
1 1 0 0 0 1 0 1 0 0 0 0 0 1
2 0 0 1 0 0 1 0 0 0 1 0 0 0
3 0 1 0 1 1 1 0 1 0 0 0 1 0

TF-IDF Model

  • term frequency inverse document frequency
  • generally more popular than bag of words model
  • numerical statistic to show how important an token is to a document
  • TF-IDF = term frequency * (1 / document frequency)
  • TF - how frequent a term(token) occurs in a document
  • IDF - inverse of how frequent a term occurs across documents
from sklearn.feature_extraction.text import TfidfVectorizer
def createDTM(messages):
    vect = TfidfVectorizer()
    dtm = vect.fit_transform(messages) # create DTM
    
    # create pandas dataframe of DTM
    return pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names()) 
messages = ["Hey lets get lunch :)",
           "Hey!!! I need a favor"]
createDTM(messages)
favor get hey lets lunch need
0 0.000000 0.534046 0.379978 0.534046 0.534046 0.000000
1 0.631667 0.000000 0.449436 0.000000 0.000000 0.631667
  • 'hey' has lowest value, only word that occurs in both documents
  • 'favor' and 'need' have highest, occur in 1 document with fewest tokens
# add repeats of 'hey' to first message
messages = ["Hey hey hey lets get lunch :)",
           "Hey!!! I need a favor"]
createDTM(messages)
favor get hey lets lunch need
0 0.000000 0.363788 0.776515 0.363788 0.363788 0.000000
1 0.631667 0.000000 0.449436 0.000000 0.000000 0.631667
  • TF for 'hey' in first increases, but IDF for 'hey' remains the same
# remove 'hey' from second message
messages = ["Hey hey hey lets get lunch :)",
           "I need a favor"]
createDTM(messages)
favor get hey lets lunch need
0 0.000000 0.288675 0.866025 0.288675 0.288675 0.000000
1 0.707107 0.000000 0.000000 0.000000 0.000000 0.707107
  • 'hey' for first message is now the highest value
  • 'favor' and 'need' also increase as there are now fewer tokens in the second message

Preprocessing and Hyperparameters

  • max_features = n : only considers the top n words when ordered by term frequency
  • min_df = n : ignores words with a document frequency below n
  • max_df = n : ignores words with a document frequency above n
  • stop_words = [’’] : ignores common words like 'the', 'that', 'which' etc.
vect = CountVectorizer(stop_words='english')
print vect.get_stop_words()
frozenset(['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves', 'fify', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom', 'seeming', 'under', 'ours', 'has', 'might', 'thereafter', 'latterly', 'do', 'them', 'his', 'around', 'than', 'get', 'very', 'de', 'none', 'cannot', 'every', 'whether', 'they', 'front', 'during', 'thus', 'now', 'him', 'nor', 'name', 'several', 'hereafter', 'always', 'who', 'cry', 'whither', 'this', 'someone', 'either', 'each', 'become', 'thereupon', 'sometime', 'side', 'two', 'therein', 'twelve', 'because', 'often', 'ten', 'our', 'eg', 'some', 'back', 'up', 'go', 'namely', 'towards', 'are', 'further', 'beyond', 'ourselves', 'yet', 'out', 'even', 'will', 'what', 'still', 'for', 'bottom', 'mine', 'since', 'please', 'forty', 'per', 'its', 'everything', 'behind', 'un', 'above', 'between', 'it', 'neither', 'seemed', 'ever', 'across', 'she', 'somehow', 'be', 'we', 'full', 'never', 'sixty', 'however', 'here', 'otherwise', 'were', 'whereupon', 'nowhere', 'although', 'found', 'alone', 're', 'along', 'fifteen', 'by', 'both', 'about', 'last', 'would', 'anything', 'via', 'many', 'could', 'thence', 'put', 'against', 'keep', 'etc', 'amount', 'became', 'ltd', 'hence', 'onto', 'or', 'con', 'among', 'already', 'co', 'afterwards', 'formerly', 'within', 'seems', 'into', 'others', 'while', 'whatever', 'except', 'down', 'hers', 'everyone', 'done', 'least', 'another', 'whoever', 'moreover', 'couldnt', 'throughout', 'anyhow', 'yourself', 'three', 'from', 'her', 'few', 'together', 'top', 'there', 'due', 'been', 'next', 'anyone', 'eleven', 'much', 'call', 'therefore', 'interest', 'then', 'thru', 'themselves', 'hundred', 'was', 'sincere', 'empty', 'more', 'himself', 'elsewhere', 'mostly', 'on', 'fire', 'am', 'becoming', 'hereby', 'amongst', 'else', 'part', 'everywhere', 'too', 'herself', 'former', 'those', 'he', 'me', 'myself', 'made', 'twenty', 'these', 'bill', 'cant', 'us', 'until', 'besides', 'nevertheless', 'below', 'anywhere', 'nine', 'can', 'of', 'toward', 'my', 'something', 'and', 'whereafter', 'whenever', 'give', 'almost', 'wherever', 'is', 'describe', 'beforehand', 'herein', 'an', 'as', 'itself', 'at', 'have', 'in', 'seem', 'whence', 'ie', 'any', 'fill', 'again', 'hasnt', 'inc', 'thereby', 'thin', 'no', 'perhaps', 'latter', 'meanwhile', 'when', 'detail', 'same', 'wherein', 'beside', 'also', 'that', 'other', 'take', 'which', 'becomes', 'you', 'if', 'nobody', 'see', 'though', 'may', 'after', 'upon', 'most', 'hereupon', 'eight', 'but', 'serious', 'nothing', 'such', 'your', 'why', 'a', 'off', 'whereby', 'third', 'i', 'whole', 'noone', 'sometimes', 'well', 'amoungst', 'yours', 'their', 'rather', 'without', 'so', 'five', 'the', 'first', 'whereas', 'once'])
# defining our own stopwords
my_words = ['buffalo','data','science']
vect = CountVectorizer(stop_words=my_words)
print vect.get_stop_words()
frozenset(['data', 'buffalo', 'science'])

Word Stemming

  • reduces a word down to its base/root form
  • crude heuristic that works by chopping off end of word
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
tokens = ['manufactured','manufacturing','manufacture']
stems = [stemmer.stem(i) for i in tokens]
print stems
[u'manufactur', u'manufactur', u'manufactur']

Word Lemmatization

  • similar to stemming
  • seeks to find base dictionary form
  • more complex, may need to specify part of speech for accurate results
from nltk import WordNetLemmatizer
lemmer = WordNetLemmatizer()
tokens = ['hands','women']
lemmas = [lemmer.lemmatize(i) for i in tokens]
print lemmas
[u'hand', u'woman']
lemmer.lemmatize('manufacturing')
'manufacturing'
# specify it as a verb, default is noun
lemmer.lemmatize('manufacturing','v')
u'manufacture'

Example

  • dataset of song lyrics from 4 different artists (Beatles, Metallica, Eminem, Bob Dylan)
  • we will use a vectorizer and then try to plot
  • would expect similar songs to be close together
df = pd.read_csv('lyrics.txt', sep='\t')
df
artist song lyrics
0 Beatles Help! (When) When I was younger (When I was young) s...
1 Beatles Ticket to Ride I think I'm gonna be sad, I think it's today, ...
2 Beatles A Hard Days Night It's been a hard day's night, and I been worki...
3 Beatles Cant Buy Me Love Can't buy me love, love Can't buy me love I'll...
4 Beatles Eleanor Rigby Ah look at all the lonely people Ah look at al...
5 Beatles I Want to Hold Your Hand Oh yeah, I'll tell you something I think you'l...
6 Beatles She Loves You She loves you, yeah, yeah, yeah She loves you,...
7 Beatles Yesterday Yesterday all my troubles seemed so far away. ...
8 Metallica Nothing Else Matters So close no matter how far Couldn't be much mo...
9 Metallica Enter Sandman Say your prayers, little one Don't forget, my ...
10 Metallica Master of Puppets End of passion play, crumbling away I’m your s...
11 Metallica The Unforgiven New blood joins this earth, And quickly he's s...
12 Metallica Fade to Black Life, it seems, will fade away Drifting furthe...
13 Metallica One I can’t remember anything Can’t tell if this i...
14 Metallica For Whom the Bell Tolls Make his fight on the hill in the early day Co...
15 Eminem The Real Slim Shady May I have your attention please? May I have y...
16 Eminem Till I Collapse 'Cause sometimes you just feel tired, Feel wea...
17 Eminem Lose Yourself Look, if you had, one shot, or one opportunity...
18 Eminem Stan My tea's gone cold I'm wondering why I got out...
19 Eminem My Name Is Hi! My name is... (what?) My name is... (who?)...
20 Eminem Like Toy Soldiers Step by step, heart to heart, left right left ...
21 Eminem When I'm Gone Yeah... It's my life... My own words I guess.....
22 Eminem Mockingbird Yeah I know sometimes things may not always ma...
23 Eminem Without Me Obie Trice/Real Name No Gimmicks [2x] two trai...
24 Bob Dylan Blowin in the Wind How many roads must a man walk down Before you...
25 Bob Dylan Mr Tambourin Man Hey ! Mr Tambourine Man, play a song for me I'...
26 Bob Dylan Its All Over Now Baby Blue You must leave now, take what you need, you th...
27 Bob Dylan The Times They are A-changin Come gather 'round people Wherever you roam An...
28 Bob Dylan Hurricane Pistols shots ring out in the barroom night En...
29 Bob Dylan It aint me babe Go 'way from my window Leave at your own chose...
30 Bob Dylan Maggies Farm I ain't gonna work on Maggie's farm no more No...
31 Bob Dylan A Hard Rains A-gonna Fall Oh, where have you been, my blue-eyed son? And...
vect = TfidfVectorizer(stop_words='english',max_df=0.7)
dtm = vect.fit_transform(df['lyrics'])
repr(dtm)
"<32x1984 sparse matrix of type '<type 'numpy.float64'>'\n\twith 3471 stored elements in Compressed Sparse Row format>"
  • we can’t plot 1984 dimensions in an effective way
  • need to reduce dimensionality to 2 dimensions
  • use Principle Component Analysis (PCA)
  • describes data using smaller number of dimensions
  • trys to retain variance and ‘structure’ of the data
# Principle Component Analysis (PCA) to reduce down to two dimensions
from sklearn.decomposition import PCA
X_pca = PCA(n_components=2).fit_transform(dtm.toarray())
df['A'] = X_pca[:,0]
df['B'] = X_pca[:,1]
df.head()
artist song lyrics A B
0 Beatles Help! (When) When I was younger (When I was young) s... -0.204841 0.018931
1 Beatles Ticket to Ride I think I'm gonna be sad, I think it's today, ... 0.107004 -0.258968
2 Beatles A Hard Days Night It's been a hard day's night, and I been worki... 0.044426 -0.247208
3 Beatles Cant Buy Me Love Can't buy me love, love Can't buy me love I'll... 0.085107 -0.354033
4 Beatles Eleanor Rigby Ah look at all the lonely people Ah look at al... -0.232241 0.069967
import seaborn as sns
import matplotlib.pyplot as plt
sns.lmplot(x='A', y='B', data=df,fit_reg=False, hue='artist')
plt.show()

png

If we had used CountVectorizer instead of TfidfVectorizer: png

N-gram Model

  • n-gram is a sequence of n words
  • bag of words model is actually a specific case of the N-gram model where n=1
  • Consider the string 'Buffalo Data Science Meetup'
  • n=1 (unigram) : 'Buffalo','Data','Science','Meetup' (Bag of words model)
  • n=2 (bigram) : 'Buffalo Data', 'Data Science', 'Science Meetup'
  • n=3 (trigram) : 'Buffalo Data Science','Data Science Meetup
  • using n-gram model info about order of tokens
messages = ["Hey hey hey lets go get lunch today :)",
           "Hey!!! I need a favor"]
# look at bigrams
vect = CountVectorizer(ngram_range=(2,2))
dtm = vect.fit_transform(messages)
pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())
get lunch go get hey hey hey lets hey need lets go lunch today need favor
0 1 1 2 1 0 1 1 0
1 0 0 0 0 1 0 0 1
# look at trigrams
vect = CountVectorizer(ngram_range=(3,3))
dtm = vect.fit_transform(messages)
pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())
get lunch today go get lunch hey hey hey hey hey lets hey lets go hey need favor lets go get
0 1 1 1 1 1 0 1
1 0 0 0 0 0 1 0
# looking at unigrams, bigrams, and trigrams
vect = CountVectorizer(ngram_range=(1,3))
dtm = vect.fit_transform(messages)
pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())
favor get get lunch get lunch today go go get go get lunch hey hey hey hey hey hey ... hey need hey need favor lets lets go lets go get lunch lunch today need need favor today
0 0 1 1 1 1 1 1 3 2 1 ... 0 0 1 1 1 1 1 0 0 1
1 1 0 0 0 0 0 0 1 0 0 ... 1 1 0 0 0 0 0 1 1 0

2 rows × 23 columns

# Can also use with tf-idf
vect = TfidfVectorizer(ngram_range=(2,2))
dtm = vect.fit_transform(messages)
pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())
get lunch go get hey hey hey lets hey need lets go lunch today need favor
0 0.333333 0.333333 0.666667 0.333333 0.000000 0.333333 0.333333 0.000000
1 0.000000 0.000000 0.000000 0.000000 0.707107 0.000000 0.000000 0.707107

Questions?