2

I am doing a data cleaning task on a text file full of sentences. After stemming these sentences I would like to get the frequency of the words in my stemmed list. However I am encountering a problem as when printing the stemmed list, stem_list, I am obtaining a list for every sentence like so :

[u'anyon', u'think', u'forgotten', u'day', u'parti', u'friend', u'friend', u'paymast', u'us', u'longer', u'memori']

[u'valu', u'friend', u'bought', u'properti', u'actual', u'relev', u'repres', u'actual', u'valu', u'properti']

[u'monster', u'wreck', u'reef', u'cargo', u'vessel', u'week', u'passeng', u'ship', u'least', u'24', u'hour', u'upload', u'com']

I would like to obtain the frequency of all of the words but I am only obtaining the frequency per sentence by using the following code:

   fdist = nltk.FreqDist(stem_list)
   for word, frequency in fdist.most_common(50):
         print(u'{};{}'.format(word, frequency))

This is producing the following output: friend;2 paymast;1 longer;1 memori;1 parti;1 us;1 day;1 anyon;1 forgotten;1 think;1 actual;2 properti;2 valu;2 friend;1 repres;1 relev;1 bought;1 week;1 cargo;1 monster;1 hour;1 wreck;1 upload;1 passeng;1 least;1 reef;1 24;1 vessel;1 ship;1 com;1 within;1 area;1 territori;1 custom;1 water;1 3;1

The word 'friend' is being counted twice since it is in two different sentences. How would I be able to make it count friend once and display friend;3 in this case?

Andre Croucher
  • 395
  • 1
  • 3
  • 9

3 Answers3

0

You could just concatenate everything in one list :

stem_list = [inner for outer in stem_list for inner in outer]

and process the same way you do.

Otherwise, you could keep the same code but instead of printing you create a dict and populate it with values you get. Each time you get a new word, you create the key, then you add the value.

all_words_count = dict()
for word, frequency in fdist.most_common(50):
    if word in all_words_count : # Already found
        all_words_count[word] += frequency
    else : # Not found yet
        all_words_count[word] = frequency

for word in all_words_count : 
    print(u'{};{}'.format(word, all_words_count[word]))
iFlo
  • 1,442
  • 10
  • 19
  • I have tried doing that but it ends up printing out each letter separately, like so: [u'a', u'n', u'y', u'o', u'n', u't', u'h', u'i', u'n', u'k', u'f', u'o', u'r', u'g', u'o', u't', u't', u'e', u'n', u'd', u'a', u'y', u'p', u'a', u'r', u't', u'i', u'f', u'r', u'i', u'e', u'n', u'd', u'f', u'r', u'i', u'e', u'n', u'd', u'p', u'a', u'y', u'm', u'a', u's', u't', u'u', u's', u'l', u'o', u'n', u'g', u'e', u'r', u'm', u'e', u'm', u'o', u'r', u'i'] – Andre Croucher Dec 23 '16 at 09:28
  • What exactly is stem_list ? Is it a list of list ? Your structure in your post is not clear. – iFlo Dec 23 '16 at 09:31
  • Sorry for that, they are word-vectors since I had tokenized my text file (containing the sentences) before. – Andre Croucher Dec 23 '16 at 09:43
0

I think the easyest way is to combine the arrays before passing it to the function.

allwords = [inner for outer in stem_list for inner in outer]

fdist = nltk.FreqDist(allwords)
    for word, frequency in fdist.most_common(50):
        print(y'{};{}'.format(word, frequency))

or shorter:

fdist = nltk.FreqDist([inner for outer in stem_list for inner in outer])
    for word, frequency in fdist.most_common(50):
        print(y'{};{}'.format(word, frequency))

I think your input looks like:

stem_list = [[u'anyon', u'think', u'forgotten', u'day', u'parti', u'friend', u'friend', u'paymast', u'us', u'longer', u'memori'],

            [u'valu', u'friend', u'bought', u'properti', u'actual', u'relev', u'repres', u'actual', u'valu', u'properti'],

            [u'monster', u'wreck', u'reef', u'cargo', u'vessel', u'week', u'passeng', u'ship', u'least', u'24', u'hour', u'upload', u'com'],

            [.....], etc for the other sentences ]

so you have two arrays - first for sentences and second for words in sentenc. With allwords = [inner for outer in stem_list for inner in outer] you run through the sentences and combine them as one array of words.

Michael Weber
  • 343
  • 1
  • 3
  • 9
  • `allwords = [sent for sent in stem_list]` will not do anything. It will take the inner list and then put it in list. The `stem_list` remain the same except it's also being referenced by `allwords` – iFlo Dec 23 '16 at 09:34
  • 1
    Thanks, have corrected it with the ieterrator form iFlo - havn't checked it. – Michael Weber Dec 23 '16 at 09:42
0

You could flatten your 2D array first with chain.from_iterable:

fdist = nltk.FreqDist(chain.from_iterable(stem_list)):
    for word, frequency in fdist.most_common(50):
        print(u'{};{}'.format(word, frequency))
Community
  • 1
  • 1
trincot
  • 317,000
  • 35
  • 244
  • 286