Python text processing: NLTK and pandas

Question

I'm looking for an effective way to construct a Term Document Matrix in Python that can be used together with extra data.

I have some text data with a few other attributes. I would like to run some analyses on the text and I would like to be able to correlate features extracted from text (such as individual word tokens or LDA topics) with the other attributes.

My plan was load the data as a pandas data frame and then each response will represent a document. Unfortunately, I ran into an issue:

import pandas as pd
import nltk

pd.options.display.max_colwidth = 10000

txt_data = pd.read_csv("data_file.csv",sep="|")
txt = str(txt_data.comment)
len(txt)
Out[7]: 71581 

txt = nltk.word_tokenize(txt)
txt = nltk.Text(txt)
txt.count("the")
Out[10]: 45

txt_lines = []
f = open("txt_lines_only.txt")
for line in f:
    txt_lines.append(line)

txt = str(txt_lines)
len(txt)
Out[14]: 1668813

txt = nltk.word_tokenize(txt)
txt = nltk.Text(txt)
txt.count("the")
Out[17]: 10086

Note that in both cases, text was processed in such a way that only the anything but spaces, letters and ,.?! was removed (for simplicity).

As you can see a pandas field converted into a string returns fewer matches and the length of the string is also shorter.

Is there any way to improve the above code?

Also, str(x) creates 1 big string out of the comments while [str(x) for x in txt_data.comment] creates a list object which cannot be broken into a bag of words. What is the best way to produce a nltk.Text object that will retain document indices? In other words I'm looking for a way to create a Term Document Matrix, R's equivalent of TermDocumentMatrix() from tm package.

Many thanks.

not sure what your question is, but there are other libraries for NLP that might be of help for you, libraries like pattern, textblob, C&C, if you reached a dead end you can try those libraries too, each of them has their own advantage over the others. — mid, Jan 14 '16 at 08:01
Thanks @mid , I'm aware of gensim, but I've never heard of textblob previously, it does indeed look useful though! I'm quite new to Python (I usually work in R) and I really doubt that I've reached a dead end with NLTK, considering how popular the package is, I'm certain that I'm just missing something. — IVR, Jan 16 '16 at 02:57

score 12 · Accepted Answer · answered Jan 14 '16 at 08:49

The benefit of using a pandas DataFrame would be to apply the nltk functionality to each row like so:

word_file = "/usr/share/dict/words"
words = open(word_file).read().splitlines()[10:50]
random_word_list = [[' '.join(np.random.choice(words, size=1000, replace=True))] for i in range(50)]

df = pd.DataFrame(random_word_list, columns=['text'])
df.head()

                                                text
0  Aaru Aaronic abandonable abandonedly abaction ...
1  abampere abampere abacus aback abalone abactor...
2  abaisance abalienate abandonedly abaff abacina...
3  Ababdeh abalone abac abaiser abandonable abact...
4  abandonable abandon aba abaiser abaft Abama ab...

len(df)

50

txt = df.text.apply(word_tokenize)
txt.head()

0    [Aaru, Aaronic, abandonable, abandonedly, abac...
1    [abampere, abampere, abacus, aback, abalone, a...
2    [abaisance, abalienate, abandonedly, abaff, ab...
3    [Ababdeh, abalone, abac, abaiser, abandonable,...
4    [abandonable, abandon, aba, abaiser, abaft, Ab...

txt.apply(len)

0     1000
1     1000
2     1000
3     1000
4     1000
....
44    1000
45    1000
46    1000
47    1000
48    1000
49    1000
Name: text, dtype: int64

As a result, you get the .count() for each row entry:

txt = txt.apply(lambda x: nltk.Text(x).count('abac'))
txt.head()

0    27
1    24
2    17
3    25
4    32

You can then sum the result using:

txt.sum()

1239

Thanks @Stefan, that just about resolves my problem however `txt` object is still a pandas data frame object which means that I can only use some of NLTK functions using `apply`, `map` or `for` loops. However, if I want to do something like `nltk.Text(txt).concordance("the")` I will run into problems. To resolve this I will still need to convert the entire text variable into a string and as we saw in my first example, that string will be truncated for some reason. Any thoughts on how to overcome this? Many thanks! — IVR, Jan 16 '16 at 03:02
You can convert the entire `text` `column` into one list of words using: `[t for t in df.text.tolist()]` - either after creation or after `.tokenize()`. — Stefan, Jan 18 '16 at 14:48

Python text processing: NLTK and pandas

1 Answers1

Linked