10

I'm looking for an effective way to construct a Term Document Matrix in Python that can be used together with extra data.

I have some text data with a few other attributes. I would like to run some analyses on the text and I would like to be able to correlate features extracted from text (such as individual word tokens or LDA topics) with the other attributes.

My plan was load the data as a pandas data frame and then each response will represent a document. Unfortunately, I ran into an issue:

import pandas as pd
import nltk

pd.options.display.max_colwidth = 10000

txt_data = pd.read_csv("data_file.csv",sep="|")
txt = str(txt_data.comment)
len(txt)
Out[7]: 71581 

txt = nltk.word_tokenize(txt)
txt = nltk.Text(txt)
txt.count("the")
Out[10]: 45

txt_lines = []
f = open("txt_lines_only.txt")
for line in f:
    txt_lines.append(line)

txt = str(txt_lines)
len(txt)
Out[14]: 1668813

txt = nltk.word_tokenize(txt)
txt = nltk.Text(txt)
txt.count("the")
Out[17]: 10086

Note that in both cases, text was processed in such a way that only the anything but spaces, letters and ,.?! was removed (for simplicity).

As you can see a pandas field converted into a string returns fewer matches and the length of the string is also shorter.

Is there any way to improve the above code?

Also, str(x) creates 1 big string out of the comments while [str(x) for x in txt_data.comment] creates a list object which cannot be broken into a bag of words. What is the best way to produce a nltk.Text object that will retain document indices? In other words I'm looking for a way to create a Term Document Matrix, R's equivalent of TermDocumentMatrix() from tm package.

Many thanks.

Stefan
  • 41,759
  • 13
  • 76
  • 81
IVR
  • 1,718
  • 2
  • 23
  • 41
  • 1
    not sure what your question is, but there are other libraries for NLP that might be of help for you, libraries like pattern, textblob, C&C, if you reached a dead end you can try those libraries too, each of them has their own advantage over the others. – mid Jan 14 '16 at 08:01
  • Thanks @mid , I'm aware of gensim, but I've never heard of textblob previously, it does indeed look useful though! I'm quite new to Python (I usually work in R) and I really doubt that I've reached a dead end with NLTK, considering how popular the package is, I'm certain that I'm just missing something. – IVR Jan 16 '16 at 02:57

1 Answers1

12

The benefit of using a pandas DataFrame would be to apply the nltk functionality to each row like so:

word_file = "/usr/share/dict/words"
words = open(word_file).read().splitlines()[10:50]
random_word_list = [[' '.join(np.random.choice(words, size=1000, replace=True))] for i in range(50)]

df = pd.DataFrame(random_word_list, columns=['text'])
df.head()

                                                text
0  Aaru Aaronic abandonable abandonedly abaction ...
1  abampere abampere abacus aback abalone abactor...
2  abaisance abalienate abandonedly abaff abacina...
3  Ababdeh abalone abac abaiser abandonable abact...
4  abandonable abandon aba abaiser abaft Abama ab...

len(df)

50

txt = df.text.apply(word_tokenize)
txt.head()

0    [Aaru, Aaronic, abandonable, abandonedly, abac...
1    [abampere, abampere, abacus, aback, abalone, a...
2    [abaisance, abalienate, abandonedly, abaff, ab...
3    [Ababdeh, abalone, abac, abaiser, abandonable,...
4    [abandonable, abandon, aba, abaiser, abaft, Ab...

txt.apply(len)

0     1000
1     1000
2     1000
3     1000
4     1000
....
44    1000
45    1000
46    1000
47    1000
48    1000
49    1000
Name: text, dtype: int64

As a result, you get the .count() for each row entry:

txt = txt.apply(lambda x: nltk.Text(x).count('abac'))
txt.head()

0    27
1    24
2    17
3    25
4    32

You can then sum the result using:

txt.sum()

1239
Stefan
  • 41,759
  • 13
  • 76
  • 81
  • Thanks @Stefan, that just about resolves my problem however `txt` object is still a pandas data frame object which means that I can only use some of NLTK functions using `apply`, `map` or `for` loops. However, if I want to do something like `nltk.Text(txt).concordance("the")` I will run into problems. To resolve this I will still need to convert the entire text variable into a string and as we saw in my first example, that string will be truncated for some reason. Any thoughts on how to overcome this? Many thanks! – IVR Jan 16 '16 at 03:02
  • 1
    You can convert the entire `text` `column` into one list of words using: `[t for t in df.text.tolist()]` - either after creation or after `.tokenize()`. – Stefan Jan 18 '16 at 14:48