Buffalo Data Science Talk

I recently gave a talk at a Buffalo Data Science Meetup on Text Analytics in Python. It’s adapted from my post on Feature Extraction from Text with some added material and an example.

Intro to Text Analytics in Python

Terminology
Bag of Words Model
TF-IDF Model
Preprocessing and Hyperparameters
Example
N-gram model

Terminology

Document - a single string of text information
Corpus - a collection of documents
Token - a word, phrase or symbol derived from a document
Tokenizer - function to split a document into a list of tokens

# Example corpus
messages = ["Hey hey hey lets go get lunch today :)",
           "Did you go home?",
           "Hey!!! I need a favor"]

# Example document
document = messages[0]
document

'Hey hey hey lets go get lunch today :)'

# Creating tokens
document.split(' ')

['Hey', 'hey', 'hey', 'lets', 'go', 'get', 'lunch', 'today', ':)']

Bag of Words Model

need a numerical representation for our corpus
will use CountVectorizer() from sci-kit learn library
creates matrix of token counts

# import and instantiate CountVectorizer()
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

next we will use fit() and transform() methods
similar to fit() and predict() used in ML classifiers

vect.fit(messages)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

# before transforming look at feature names (columns names)
print vect.get_feature_names()
print 'Number of tokens: {}'.format(len(vect.get_feature_names()))

[u'did', u'favor', u'get', u'go', u'hey', u'home', u'lets', u'lunch', u'need', u'today', u'you']
Number of tokens: 11

Things to note:

all lowercase
words less than two letters are excluded
punctuation removed
no duplicates

Next, we’ll use the transform() method to create a document term matrix(DTM). This is the matrix of token counts we want to create.

dtm = vect.transform(messages)
repr(dtm)

"<3x11 sparse matrix of type '<type 'numpy.int64'>'\n\twith 13 stored elements in Compressed Sparse Row format>"

print dtm

  (0, 2)	1
  (0, 3)	1
  (0, 4)	3
  (0, 6)	1
  (0, 7)	1
  (0, 9)	1
  (1, 0)	1
  (1, 3)	1
  (1, 5)	1
  (1, 10)	1
  (2, 1)	1
  (2, 4)	1
  (2, 8)	1

Because each document has a column for every word that occurs in the corpus, DTM is predominatly filled with 0’s
Sparse format can store the DTM in a smaller amount of memory and can speed up operations
a DTM of a large corpus can quickly balloon in size

import pandas as pd
pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())

	did	favor	get	go	hey	home	lets	lunch	need	today	you
0	0	0	1	1	3	0	1	1	0	1	0
1	1	0	0	1	0	1	0	0	0	0	1
2	0	1	0	0	1	0	0	0	1	0	0

# get total counts for corpus
pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names()).sum()

did      1
favor    1
get      1
go       2
hey      4
home     1
lets     1
lunch    1
need     1
today    1
you      1
dtype: int64

What happens if we get a new message?

new_message = ['Hey lets go get a drink tonight']
new_dtm = vect.transform(new_message)
pd.DataFrame(new_dtm.toarray(), columns=vect.get_feature_names())

	did	favor	get	go	hey	home	lets	lunch	need	today	you
0	0	0	1	1	1	0	1	0	0	0	0

only tokens from original fit appear as features(columns)
need to refit with new message included

messages.append(new_message[0])
messages

['Hey hey hey lets go get lunch today :)',
 'Did you go home?',
 'Hey!!! I need a favor',
 'Hey lets go get a drink tonight']

dtm = vect.fit_transform(messages)
pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())

	did	drink	favor	get	go	hey	home	lets	lunch	need	today	tonight	you
0	0	0	0	1	1	3	0	1	1	0	1	0	0
1	1	0	0	0	1	0	1	0	0	0	0	0	1
2	0	0	1	0	0	1	0	0	0	1	0	0	0
3	0	1	0	1	1	1	0	1	0	0	0	1	0

TF-IDF Model

term frequency inverse document frequency
generally more popular than bag of words model
numerical statistic to show how important an token is to a document
TF-IDF = term frequency * (1 / document frequency)
TF - how frequent a term(token) occurs in a document
IDF - inverse of how frequent a term occurs across documents

from sklearn.feature_extraction.text import TfidfVectorizer
def createDTM(messages):
    vect = TfidfVectorizer()
    dtm = vect.fit_transform(messages) # create DTM
    
    # create pandas dataframe of DTM
    return pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names()) 

messages = ["Hey lets get lunch :)",
           "Hey!!! I need a favor"]
createDTM(messages)

	favor	get	hey	lets	lunch	need
0	0.000000	0.534046	0.379978	0.534046	0.534046	0.000000
1	0.631667	0.000000	0.449436	0.000000	0.000000	0.631667

'hey' has lowest value, only word that occurs in both documents
'favor' and 'need' have highest, occur in 1 document with fewest tokens

# add repeats of 'hey' to first message
messages = ["Hey hey hey lets get lunch :)",
           "Hey!!! I need a favor"]
createDTM(messages)

	favor	get	hey	lets	lunch	need
0	0.000000	0.363788	0.776515	0.363788	0.363788	0.000000
1	0.631667	0.000000	0.449436	0.000000	0.000000	0.631667

TF for 'hey' in first increases, but IDF for 'hey' remains the same

# remove 'hey' from second message
messages = ["Hey hey hey lets get lunch :)",
           "I need a favor"]
createDTM(messages)

	favor	get	hey	lets	lunch	need
0	0.000000	0.288675	0.866025	0.288675	0.288675	0.000000
1	0.707107	0.000000	0.000000	0.000000	0.000000	0.707107

'hey' for first message is now the highest value
'favor' and 'need' also increase as there are now fewer tokens in the second message

Preprocessing and Hyperparameters

max_features = n : only considers the top n words when ordered by term frequency
min_df = n : ignores words with a document frequency below n
max_df = n : ignores words with a document frequency above n
stop_words = [’’] : ignores common words like 'the', 'that', 'which' etc.

vect = CountVectorizer(stop_words='english')
print vect.get_stop_words()

frozenset(['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves', 'fify', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom', 'seeming', 'under', 'ours', 'has', 'might', 'thereafter', 'latterly', 'do', 'them', 'his', 'around', 'than', 'get', 'very', 'de', 'none', 'cannot', 'every', 'whether', 'they', 'front', 'during', 'thus', 'now', 'him', 'nor', 'name', 'several', 'hereafter', 'always', 'who', 'cry', 'whither', 'this', 'someone', 'either', 'each', 'become', 'thereupon', 'sometime', 'side', 'two', 'therein', 'twelve', 'because', 'often', 'ten', 'our', 'eg', 'some', 'back', 'up', 'go', 'namely', 'towards', 'are', 'further', 'beyond', 'ourselves', 'yet', 'out', 'even', 'will', 'what', 'still', 'for', 'bottom', 'mine', 'since', 'please', 'forty', 'per', 'its', 'everything', 'behind', 'un', 'above', 'between', 'it', 'neither', 'seemed', 'ever', 'across', 'she', 'somehow', 'be', 'we', 'full', 'never', 'sixty', 'however', 'here', 'otherwise', 'were', 'whereupon', 'nowhere', 'although', 'found', 'alone', 're', 'along', 'fifteen', 'by', 'both', 'about', 'last', 'would', 'anything', 'via', 'many', 'could', 'thence', 'put', 'against', 'keep', 'etc', 'amount', 'became', 'ltd', 'hence', 'onto', 'or', 'con', 'among', 'already', 'co', 'afterwards', 'formerly', 'within', 'seems', 'into', 'others', 'while', 'whatever', 'except', 'down', 'hers', 'everyone', 'done', 'least', 'another', 'whoever', 'moreover', 'couldnt', 'throughout', 'anyhow', 'yourself', 'three', 'from', 'her', 'few', 'together', 'top', 'there', 'due', 'been', 'next', 'anyone', 'eleven', 'much', 'call', 'therefore', 'interest', 'then', 'thru', 'themselves', 'hundred', 'was', 'sincere', 'empty', 'more', 'himself', 'elsewhere', 'mostly', 'on', 'fire', 'am', 'becoming', 'hereby', 'amongst', 'else', 'part', 'everywhere', 'too', 'herself', 'former', 'those', 'he', 'me', 'myself', 'made', 'twenty', 'these', 'bill', 'cant', 'us', 'until', 'besides', 'nevertheless', 'below', 'anywhere', 'nine', 'can', 'of', 'toward', 'my', 'something', 'and', 'whereafter', 'whenever', 'give', 'almost', 'wherever', 'is', 'describe', 'beforehand', 'herein', 'an', 'as', 'itself', 'at', 'have', 'in', 'seem', 'whence', 'ie', 'any', 'fill', 'again', 'hasnt', 'inc', 'thereby', 'thin', 'no', 'perhaps', 'latter', 'meanwhile', 'when', 'detail', 'same', 'wherein', 'beside', 'also', 'that', 'other', 'take', 'which', 'becomes', 'you', 'if', 'nobody', 'see', 'though', 'may', 'after', 'upon', 'most', 'hereupon', 'eight', 'but', 'serious', 'nothing', 'such', 'your', 'why', 'a', 'off', 'whereby', 'third', 'i', 'whole', 'noone', 'sometimes', 'well', 'amoungst', 'yours', 'their', 'rather', 'without', 'so', 'five', 'the', 'first', 'whereas', 'once'])

# defining our own stopwords
my_words = ['buffalo','data','science']
vect = CountVectorizer(stop_words=my_words)
print vect.get_stop_words()

frozenset(['data', 'buffalo', 'science'])

Word Stemming

reduces a word down to its base/root form
crude heuristic that works by chopping off end of word

from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
tokens = ['manufactured','manufacturing','manufacture']

stems = [stemmer.stem(i) for i in tokens]
print stems

[u'manufactur', u'manufactur', u'manufactur']

Word Lemmatization

similar to stemming
seeks to find base dictionary form
more complex, may need to specify part of speech for accurate results

from nltk import WordNetLemmatizer
lemmer = WordNetLemmatizer()
tokens = ['hands','women']

lemmas = [lemmer.lemmatize(i) for i in tokens]
print lemmas

[u'hand', u'woman']

lemmer.lemmatize('manufacturing')

'manufacturing'

# specify it as a verb, default is noun
lemmer.lemmatize('manufacturing','v')

u'manufacture'

Example

dataset of song lyrics from 4 different artists (Beatles, Metallica, Eminem, Bob Dylan)
we will use a vectorizer and then try to plot
would expect similar songs to be close together

df = pd.read_csv('lyrics.txt', sep='\t')
df

	artist	song	lyrics
0	Beatles	Help!	(When) When I was younger (When I was young) s...
1	Beatles	Ticket to Ride	I think I'm gonna be sad, I think it's today, ...
2	Beatles	A Hard Days Night	It's been a hard day's night, and I been worki...
3	Beatles	Cant Buy Me Love	Can't buy me love, love Can't buy me love I'll...
4	Beatles	Eleanor Rigby	Ah look at all the lonely people Ah look at al...
5	Beatles	I Want to Hold Your Hand	Oh yeah, I'll tell you something I think you'l...
6	Beatles	She Loves You	She loves you, yeah, yeah, yeah She loves you,...
7	Beatles	Yesterday	Yesterday all my troubles seemed so far away. ...
8	Metallica	Nothing Else Matters	So close no matter how far Couldn't be much mo...
9	Metallica	Enter Sandman	Say your prayers, little one Don't forget, my ...
10	Metallica	Master of Puppets	End of passion play, crumbling away I’m your s...
11	Metallica	The Unforgiven	New blood joins this earth, And quickly he's s...
12	Metallica	Fade to Black	Life, it seems, will fade away Drifting furthe...
13	Metallica	One	I can’t remember anything Can’t tell if this i...
14	Metallica	For Whom the Bell Tolls	Make his fight on the hill in the early day Co...
15	Eminem	The Real Slim Shady	May I have your attention please? May I have y...
16	Eminem	Till I Collapse	'Cause sometimes you just feel tired, Feel wea...
17	Eminem	Lose Yourself	Look, if you had, one shot, or one opportunity...
18	Eminem	Stan	My tea's gone cold I'm wondering why I got out...
19	Eminem	My Name Is	Hi! My name is... (what?) My name is... (who?)...
20	Eminem	Like Toy Soldiers	Step by step, heart to heart, left right left ...
21	Eminem	When I'm Gone	Yeah... It's my life... My own words I guess.....
22	Eminem	Mockingbird	Yeah I know sometimes things may not always ma...
23	Eminem	Without Me	Obie Trice/Real Name No Gimmicks [2x] two trai...
24	Bob Dylan	Blowin in the Wind	How many roads must a man walk down Before you...
25	Bob Dylan	Mr Tambourin Man	Hey ! Mr Tambourine Man, play a song for me I'...
26	Bob Dylan	Its All Over Now Baby Blue	You must leave now, take what you need, you th...
27	Bob Dylan	The Times They are A-changin	Come gather 'round people Wherever you roam An...
28	Bob Dylan	Hurricane	Pistols shots ring out in the barroom night En...
29	Bob Dylan	It aint me babe	Go 'way from my window Leave at your own chose...
30	Bob Dylan	Maggies Farm	I ain't gonna work on Maggie's farm no more No...
31	Bob Dylan	A Hard Rains A-gonna Fall	Oh, where have you been, my blue-eyed son? And...

vect = TfidfVectorizer(stop_words='english',max_df=0.7)
dtm = vect.fit_transform(df['lyrics'])

repr(dtm)

"<32x1984 sparse matrix of type '<type 'numpy.float64'>'\n\twith 3471 stored elements in Compressed Sparse Row format>"

we can’t plot 1984 dimensions in an effective way
need to reduce dimensionality to 2 dimensions
use Principle Component Analysis (PCA)
describes data using smaller number of dimensions
trys to retain variance and ‘structure’ of the data

# Principle Component Analysis (PCA) to reduce down to two dimensions
from sklearn.decomposition import PCA
X_pca = PCA(n_components=2).fit_transform(dtm.toarray())

df['A'] = X_pca[:,0]
df['B'] = X_pca[:,1]
df.head()

	artist	song	lyrics	A	B
0	Beatles	Help!	(When) When I was younger (When I was young) s...	-0.204841	0.018931
1	Beatles	Ticket to Ride	I think I'm gonna be sad, I think it's today, ...	0.107004	-0.258968
2	Beatles	A Hard Days Night	It's been a hard day's night, and I been worki...	0.044426	-0.247208
3	Beatles	Cant Buy Me Love	Can't buy me love, love Can't buy me love I'll...	0.085107	-0.354033
4	Beatles	Eleanor Rigby	Ah look at all the lonely people Ah look at al...	-0.232241	0.069967

import seaborn as sns
import matplotlib.pyplot as plt
sns.lmplot(x='A', y='B', data=df,fit_reg=False, hue='artist')
plt.show()

png

If we had used CountVectorizer instead of TfidfVectorizer: png

N-gram Model

n-gram is a sequence of n words
bag of words model is actually a specific case of the N-gram model where n=1
Consider the string 'Buffalo Data Science Meetup'
n=1 (unigram) : 'Buffalo','Data','Science','Meetup' (Bag of words model)
n=2 (bigram) : 'Buffalo Data', 'Data Science', 'Science Meetup'
n=3 (trigram) : 'Buffalo Data Science','Data Science Meetup’
using n-gram model info about order of tokens

messages = ["Hey hey hey lets go get lunch today :)",
           "Hey!!! I need a favor"]

# look at bigrams
vect = CountVectorizer(ngram_range=(2,2))
dtm = vect.fit_transform(messages)
pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())

	get lunch	go get	hey hey	hey lets	hey need	lets go	lunch today	need favor
0	1	1	2	1	0	1	1	0
1	0	0	0	0	1	0	0	1

# look at trigrams
vect = CountVectorizer(ngram_range=(3,3))
dtm = vect.fit_transform(messages)
pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())

	get lunch today	go get lunch	hey hey hey	hey hey lets	hey lets go	hey need favor	lets go get
0	1	1	1	1	1	0	1
1	0	0	0	0	0	1	0

# looking at unigrams, bigrams, and trigrams
vect = CountVectorizer(ngram_range=(1,3))
dtm = vect.fit_transform(messages)
pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())

	favor	get	get lunch	get lunch today	go	go get	go get lunch	hey	hey hey	hey hey hey	...	hey need	hey need favor	lets	lets go	lets go get	lunch	lunch today	need	need favor	today
0	0	1	1	1	1	1	1	3	2	1	...	0	0	1	1	1	1	1	0	0	1
1	1	0	0	0	0	0	0	1	0	0	...	1	1	0	0	0	0	0	1	1	0

2 rows × 23 columns

# Can also use with tf-idf
vect = TfidfVectorizer(ngram_range=(2,2))
dtm = vect.fit_transform(messages)
pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())

	get lunch	go get	hey hey	hey lets	hey need	lets go	lunch today	need favor
0	0.333333	0.333333	0.666667	0.333333	0.000000	0.333333	0.333333	0.000000
1	0.000000	0.000000	0.000000	0.000000	0.707107	0.000000	0.000000	0.707107

Questions?

Share on

Twitter Facebook Google+ LinkedIn

Andrew Hintermeier