I'm looking for an effective way to construct a Term Document Matrix in Python that can be used together with extra data.
I have some text data with a few other attributes. I would like to run some analyses on the text and I would like to be able to correlate features extracted from text (such as individual word tokens or LDA topics) with the other attributes.
My plan was load the data as a pandas data frame and then each response will represent a document. Unfortunately, I ran into an issue:
import pandas as pd
import nltk
pd.options.display.max_colwidth = 10000
txt_data = pd.read_csv("data_file.csv",sep="|")
txt = str(txt_data.comment)
len(txt)
Out[7]: 71581
txt = nltk.word_tokenize(txt)
txt = nltk.Text(txt)
txt.count("the")
Out[10]: 45
txt_lines = []
f = open("txt_lines_only.txt")
for line in f:
txt_lines.append(line)
txt = str(txt_lines)
len(txt)
Out[14]: 1668813
txt = nltk.word_tokenize(txt)
txt = nltk.Text(txt)
txt.count("the")
Out[17]: 10086
Note that in both cases, text was processed in such a way that only the anything but spaces, letters and ,.?! was removed (for simplicity).
As you can see a pandas field converted into a string returns fewer matches and the length of the string is also shorter.
Is there any way to improve the above code?
Also, str(x)
creates 1 big string out of the comments while [str(x) for x in txt_data.comment]
creates a list object which cannot be broken into a bag of words. What is the best way to produce a nltk.Text
object that will retain document indices? In other words I'm looking for a way to create a Term Document Matrix, R's equivalent of TermDocumentMatrix()
from tm
package.
Many thanks.