6

I would like to preprocess a corpus of documents using Python in the same way that I can in R. For example, given an initial corpus, corpus, I would like to end up with a preprocessed corpus that corresponds to the one produced using the following R code:

library(tm)
library(SnowballC)

corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("myword", stopwords("english")))
corpus = tm_map(corpus, stemDocument)

Is there a simple or straightforward — preferably pre-built — method of doing this in Python? Is there a way to ensure exactly the same results?


For example, I would like to preprocess

@Apple ear pods are AMAZING! Best sound from in-ear headphones I've ever had!

into

ear pod amaz best sound inear headphon ive ever

josliber
  • 43,891
  • 12
  • 98
  • 133
orome
  • 45,163
  • 57
  • 202
  • 418
  • Use nltk for natural language processing in Python. – ramcdougal Apr 01 '14 at 22:00
  • @ramcdougal: That much I gathered, but I'm struggling with the documentation. – orome Apr 01 '14 at 22:05
  • Check out this [tutorial](http://nbviewer.ipython.org/urls/gist.githubusercontent.com/kljensen/9662971/raw/4628ed3a1d27b84a3c56e46d87146c1d08267893/NewHaven.io+NLP+tutorial.ipynb?create=1). It covers tokenization, stop words, and stemming. – ramcdougal Apr 01 '14 at 22:09
  • @ramcdougal: That's a good start. What I'm missing is how to apply this to a large dataset (e.g. in a Pandas dataframe) or to use it in the context of something like [`scikit-learn`'s `CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), which seems to be able to take a preprocessor as an argument. – orome Apr 01 '14 at 22:16

2 Answers2

3

It seems tricky to get things exactly the same between nltk and tm on the preprocessing steps, so I think the best approach is to use rpy2 to run the preprocessing in R and pull the results into python:

import rpy2.robjects as ro
preproc = [x[0] for x in ro.r('''
tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
library(tm)
library(SnowballC)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus = tm_map(corpus, stemDocument)''')]

Then, you can load it into scikit-learn -- the only thing you'll need to do to get things to match between the CountVectorizer and the DocumentTermMatrix is to remove terms of length less than 3:

from sklearn.feature_extraction.text import CountVectorizer
def mytokenizer(x):
    return [y for y in x.split() if len(y) > 2]

# Full document-term matrix
cv = CountVectorizer(tokenizer=mytokenizer)
X = cv.fit_transform(preproc)
X
# <1181x3289 sparse matrix of type '<type 'numpy.int64'>'
#   with 8980 stored elements in Compressed Sparse Column format>

# Sparse terms removed
cv2 = CountVectorizer(tokenizer=mytokenizer, min_df=0.005)
X2 = cv2.fit_transform(preproc)
X2
# <1181x309 sparse matrix of type '<type 'numpy.int64'>'
#   with 4669 stored elements in Compressed Sparse Column format>

Let's verify this matches with R:

tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
library(tm)
library(SnowballC)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus = tm_map(corpus, stemDocument)
dtm = DocumentTermMatrix(corpus)
dtm
# A document-term matrix (1181 documents, 3289 terms)
# 
# Non-/sparse entries: 8980/3875329
# Sparsity           : 100%
# Maximal term length: 115 
# Weighting          : term frequency (tf)

sparse = removeSparseTerms(dtm, 0.995)
sparse
# A document-term matrix (1181 documents, 309 terms)
# 
# Non-/sparse entries: 4669/360260
# Sparsity           : 99%
# Maximal term length: 20 
# Weighting          : term frequency (tf)

As you can see, the number of stored elements and terms exactly match between the two approaches now.

josliber
  • 43,891
  • 12
  • 98
  • 133
  • Is there a way to pass this to `scikit-learn`'s `CountVectorizer` constructor. The docs make it seem like this should be possible, but I can't figure out how. – orome Apr 01 '14 at 23:30
  • 1
    @raxacoricofallapatorius updated to include `CountVectorizer`. It's great to see someone working through the 15.071x content in python! – josliber Apr 02 '14 at 02:58
  • Thanks. I'm new to R (which I loathe) Python (which is great) and analytics, so it's tough going. I wish the course were taught in Python! – orome Apr 02 '14 at 03:09
  • The whole course team's pretty enthusiastic about R (which is why we selected R for the course), but glad you found packages that work for you! – josliber Apr 02 '14 at 04:51
  • This gives me 348 terms (for the whole class corpus, at the 0.995 threshold) rather than 309. Any idea why that would be? I get 3462 for `len(vectorizer.get_feature_names())`; I don't know how many the equivalent command in R is returning. – orome Apr 02 '14 at 18:11
  • 1
    3462 would then be expected to match the number of terms in your `DocumentTermMatrix` before calling `removeSparseTerms`. Can you just output the sorted names from R with `sort(colnames(as.matrix(sparseDTM)))` and compare to the output of `sorted(vectorizer.get_feature_names())`? – josliber Apr 02 '14 at 20:33
  • I've [listed the differences](https://gist.github.com/orome/9942746) between the reduced lists of words in Python (produced using essentially `Rpreproc` as above) the and R (using the code from the lecture). – orome Apr 02 '14 at 20:45
  • Well it looks like no 2-character words are being selected by R, so you'd want to remove those word stems in `Rpreproc`. It looks like you're losing stems that start with a number in python, which you may be able to solve. Unfortunately, "generat" and "genius" look like differences in the stemmer, which will be hard to solve. – josliber Apr 02 '14 at 21:04
  • Using "generat" and "genius" should be OK, right: as long as their occurrences match. I'm not sure how to keep the terms that start with a number: they generate errors in Python, so I removed them. I suppose I could just prepend a legal character. As for 'los' and 'yes' turning into 'lo' and 'ye' the appearance of 'go' and 'io', I'm not sure what to do. The former may be all right (if the mapping is 1-1) but I don't know about the latter. – orome Apr 02 '14 at 21:14
  • @raxacoricofallapatorius I think I was going down the wrong path with using the `nltk` preprocessor due to differences between the `nltk` and `tm` stemmers. I've updated to a solution that runs the preprocessing in R and pulls it back into python using the `rpy2` package. – josliber Apr 03 '14 at 02:26
  • That's a good strategy. For me, it's also important that I learn how to do this entirely in Python, both (a) in the most Pythonic way *and* (b) in a way that gets as close as possible to R. I think I can learn (a) from the docs (and from some of what was in your earlier answer); (b) I got from your earlier answer; but this takes (b) further by using R as a backend. Thanks! – orome Apr 03 '14 at 17:43
  • Follow up: This tokenizer doesn't quite work for the vandalism problem and is pretty far off for the trials problem. Any chance of a version that will replicate R for tokenizing those? – orome Apr 13 '14 at 12:12
1

CountVectorizer and TfidfVectorizer can be customized as described in the docs. In particular, you'll want to write a custom tokenizer, which is a function that takes a document and returns a list of terms. Using NLTK:

import nltk.corpus.stopwords
import nltk.stem

def smart_tokenizer(doc):
    doc = doc.lower()
    doc = re.findall(r'\w+', doc, re.UNICODE)
    return [nltk.stem.PorterStemmer().stem(term)
            for term in doc
            if term not in nltk.corpus.stopwords.words('english')]

Demo:

>>> v = CountVectorizer(tokenizer=smart_tokenizer)
>>> v.fit_transform([doc]).toarray()
array([[1, 1, 1, 2, 1, 1, 1, 1, 1]])
>>> from pprint import pprint
>>> pprint(v.vocabulary_)
{u'amaz': 0,
 u'appl': 1,
 u'best': 2,
 u'ear': 3,
 u'ever': 4,
 u'headphon': 5,
 u'pod': 6,
 u'sound': 7,
 u've': 8}

(The example I linked to actually uses a class to cache the lemmatizer, but a function works too.)

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • This doesn't handle punctuation, or custom words (like "apple" in [josilber's answer](http://stackoverflow.com/a/22798822/656912)). – orome Apr 02 '14 at 13:26
  • @raxacoricofallapatorius It's just an example. The point is that you can write a Python function and plug it in; what that function does is entirely up to you. You can pretty much plug in josilber's function. – Fred Foo Apr 02 '14 at 13:38