0

I have a directory containing corpus text files, I want to create a table which contains the number of words in each document that is table contains column of document number & row contains word count in that document for each unique word...all should be done in python...please help...thank you...

The table should look like this:

          word1   word2   word3  ...
doc1      14      5       45
doc2      6       1       0
 .
 .
 .

 

import nltk
import collections
import os.path

def cleanDoc(doc):
    stopset = set(nltk.corpus.stopwords.words('english'))
    stemmer = nltk.PorterStemmer()
    tokens = nltk.WordPunctTokenizer().tokenize(doc)
    clean = [token.lower() for token in tokens if token.lower() not in stopset and len(token) > 2]
    final = [stemmer.stem(word) for word in clean]
    return final

path = "c://Users/Desktop/corpus files"

i=0

for file in os.listdir(path) :

    f = open("c://Users/Desktop/corpus files/file%d.txt" %i,'r')
    data= f.read()
    words = cleanDoc(data)
    fw = open("c://Users/Desktop/words/words%d.txt" %i,'w')
    fd = collections.Counter(words)
    #fd = nltk.FreqDist(words)
    #plot(fd)

    row_format = "{:>15}" * (len(words) + 1)
    print row_format.format("document %d" %i, *words)
    #for

    fw.write(str(fd))
    fw.write(str(words))
    fw.close()
    i=i+1
    f.close()
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
DummyGuy
  • 425
  • 1
  • 8
  • 20
  • I'm confused about the names of the corpus text files. You have a `for` loop that will iterate over every file in the path, but then ignore those and attempt to read `file%d.txt" %i`. What are the names or what is the pattern of the names of the corpus files? – martineau Dec 01 '13 at 14:19
  • i have seperated all body parts of corpus in separate text files and i need to count unique words of each document..i have saved word counts in separate texts files just to check words...i know i need to remove it... – DummyGuy Dec 01 '13 at 15:58
  • So the corpus files have just been given names like `"file1.txt", "file2.txt", ...` correct? – martineau Dec 01 '13 at 16:10
  • If the code below doesn't suit your purposes, just say how you'd like it tailored... – duhaime Dec 01 '13 at 16:23
  • yes the corpus files given the name file1.txt, etc... – DummyGuy Dec 01 '13 at 17:52
  • i have showed the table format in above question – DummyGuy Dec 01 '13 at 18:13
  • Ah, the thing you want is called a "term-document matrix". This is simple to produce in R, if that's an option for you. If you're dedicated to NLTK, though, you might want to see: 1) http://www.puffinwarellc.com/index.php/news-and-articles/articles/33-latent-semantic-analysis-tutorial.html?start=2 or 2) http://stackoverflow.com/questions/15899861/efficient-term-document-matrix-with-nltk – duhaime Dec 01 '13 at 20:39
  • hey sorry but i am new to python n i am unable to implement the count matrix...your link is too much useful and i exactly need that...will you please tell me how to do it? – DummyGuy Dec 02 '13 at 17:00

1 Answers1

0

I think this is fairly close, if not exactly, what you want. In case it isn't, I tried to make things easy to change.

To produce the table desired processing is done two phases. In the first, the unique words in each document file of the formfile<document-number>.txtare found and saved in a corresponding words<document-number>.txtfile, plus they are added to a set of comprising all the unique words seen among all document files. This set is needed to produce table columns that consist of all the unique words in all the files and is why two phases of processing were required.

In the second phase, the word files are read back in and turned back into dictionies which used to fill in the corresponding columns of the table being printed.

import ast
import collections
import nltk
import re
import os

user_name = "UserName"
path = "c://Users/%s/Desktop/corpus files" % user_name

def cleanDoc(doc):
    stopset = set(nltk.corpus.stopwords.words('english'))
    stemmer = nltk.PorterStemmer()
    tokens = nltk.WordPunctTokenizer().tokenize(doc)
    clean = [token.lower() for token in tokens
                           if token.lower() not in stopset and len(token) > 2]
    final = [stemmer.stem(word) for word in clean]
    return final

# phase 1 -- find unique words, create word files, update overall unique word set

corpus_file_pattern = re.compile(r"""file(\d+).txt""")
unique_words = set()
longest_filename = 0
document_nums = []

for filename in os.listdir(path):
    corpus_file_match = corpus_file_pattern.match(filename)
    if corpus_file_match:  # corpus text file?
        if len(filename) > longest_filename:
            longest_filename = len(filename)
        document_num = int(corpus_file_match.group(1))
        document_nums.append(document_num)
        with open(os.path.join(path, filename)) as file:
            data = file.read()
        words = cleanDoc(data)
        unique_words.update(words)
        fd = collections.Counter(words)
        words_filename = "words%d.txt" % document_num
        with open(os.path.join(path, words_filename), mode = 'wt') as fw:
            fw.write(repr(dict(fd)) + '\n')  # write representation as dict

# phase 2 -- create table using unique_words and data in word files

unique_words_list = sorted(unique_words)
unique_words_empty_counter = collections.Counter({word: 0 for word
                                                            in unique_words})
document_nums = sorted(document_nums)
padding = 2  # spaces between columns
min_col_width = 5
col_headings = ["Document"] + unique_words_list
col_widths = [max(min_col_width, len(word))+padding for word in col_headings]
col_widths[0] = longest_filename+padding  # first col is special case

# print table headings
for i, word in enumerate(col_headings):
    print "{:{align}{width}}".format(word, align='>' if i else '<',
                                     width=col_widths[i]),
print

for document_num in document_nums:
    # read word in document dictionary back in
    filename = "words%d.txt" % document_num
    file_words = unique_words_empty_counter.copy()
    with open(os.path.join(path, filename)) as file:
        data = file.read()
    # convert data read into dict and update with file word counts
    file_words.update(ast.literal_eval(data))
    # print row of data
    print "{:<{width}}".format(filename, width=col_widths[0]),
    for i, word in enumerate(col_headings[1:], 1):
        print "{:>{width}n}".format(file_words[word], width=col_widths[i]),
    print
martineau
  • 119,623
  • 25
  • 170
  • 301
  • yes i tried your answer, but its not showing me table in any format to understand and also it is giving me result for comma's too... – DummyGuy Dec 07 '13 at 20:38
  • If there's commas in the table, it's because your `cleanDoc()` function -- which I did not change -- is putting them as word entries in the list it returns. As for not understanding the table format, I was only following what I believe you described in your question. – martineau Dec 07 '13 at 22:48
  • You're welcome. Glad to hear it's at least it's basically working for you. I spent a fair amount of time trying to get it right. – martineau Jan 28 '14 at 01:59