1

I need to count the number of words (word appearances) in some corpus using NLTK package.

Here is my corpus:

corpus = PlaintextCorpusReader('C:\DeCorpus', '.*')

Here is how I try to get the total number of words for each document:

cfd_appr = nltk.ConditionalFreqDist(
    (textname, num_appr)
    for textname in corpus.fileids()
    for num_appr in [len(w) for w in corpus.raw(fileids=textname).replace("\r", " ").replace("\n", " ").split()])

(I split strings into words manually, somehow it works better then using corpus.words(), but the problem remains the same, so it's irrelevant). Generally, this does the same (wrong) job:

cfd_appr = nltk.ConditionalFreqDist(
    (textname, num_appr)
    for textname in corpus.fileids()
    for num_appr in [len(w) for w in corpus.words(fileids=textname)])

This is what I get by typing cfd.appr.tabulate():

                        1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  
2022.12.06_Bild 2.txt   3  36 109  40  47  43  29  29  33  23  24  12   8   6   4   2   2   0   0   0   0   
2022.12.06_Bild 3.txt   2  42 129  59  57  46  46  35  22  24  17  21  13   5   6   6   2   2   2   0   0   
2022.12.06_Bild 4.txt   3  36 106  48  43  32  38  30  19  39  15  14  16   6   5   8   3   2   3   1   0   
2022.12.06_Bild 5.txt   1  55 162  83  68  72  46  24  34  38  27  16  12   8   8   5   9   3   1   5   1   
2022.12.06_Bild 6.txt   7  69 216  76 113  83  73  52  49  42  37  20  19   9   7   5   3   6   3   0   1   
2022.12.06_Bild 8.txt   0   2   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   

But these are numbers of words of different length. What I need is just this (only one type of item (text) should be counted by number of words):

2022.12.06_Bild 2.txt    451.0
2022.12.06_Bild 3.txt    538.0
2022.12.06_Bild 4.txt    471.0
2022.12.06_Bild 5.txt    679.0
2022.12.06_Bild 6.txt    890.0
2022.12.06_Bild 8.txt      3.0
dtype: float64

I.e. the sum of all words of different length (or sum of columns that was composed using DataFrame(cfd_appr).transpose().sum(axis=1). (By the way, if there is some way to set up a name for this column that would also a solution, but .rename({None: 'W. appear.'}, axis='columns') is not working, and the solution would be generally not clear enough.

So, what I need is:

                             1    
2022.12.06_Bild 2.txt    451.0
2022.12.06_Bild 3.txt    538.0
2022.12.06_Bild 4.txt    471.0
2022.12.06_Bild 5.txt    679.0
2022.12.06_Bild 6.txt    890.0
2022.12.06_Bild 8.txt      3.0

Would be grateful for help!

Gavrk
  • 295
  • 1
  • 4
  • 16
  • Good question!! Munging CFD or FD in NLTK into pandas should have been a function in NLTK =) – alvas Feb 19 '20 at 08:10
  • 1
    It'll be really nice if there's a pull-request to NLTK where we can do `ConditionalFreqDist.to_pandas` and it returns a `pd.DataFrame`. – alvas Feb 19 '20 at 08:10

2 Answers2

1

Lets first try to replicate your table with the infamous BookCorpus, with directory structure:

/books_in_sentences
   books_large_p1.txt
   books_large_p2.txt

In Code:

from nltk.corpus import PlaintextCorpusReader
from nltk import ConditionalFreqDist
from nltk import word_tokenize

from collections import Counter

import pandas as pd

corpus = PlaintextCorpusReader('books_in_sentences/', '.*')

cfd_appr = ConditionalFreqDist(
    (textname, num_appr)
    for textname in corpus.fileids()
    for num_appr in [len(w) for w in 
                     word_tokenize(corpus.raw(fileids=textname))])

Then the pandas munging part:

# Idiom to convert a FreqDist / ConditionalFreqDist into pd.DataFrame.
df = pd.DataFrame([dict(Counter(freqdist)) 
                   for freqdist in cfd_appr.values()], 
                 index=cfd_appr.keys())
# Fill in the not-applicable with zeros.
df = df.fillna(0).astype(int)

# If necessary, sort order of columns and add accordingly.
df = df.sort_values(list(df))

# Sum all columns per row -> pd.Series
counts_per_row = df.sum(axis=1)

Finally, to access the indexed Series, e.g. :

print('books_large_p1.txt', counts_per_row['books_large_p1.txt'])

Alternatively

I would encourage the above solution so that you can work with the DataFrame to manipulate the numbers further but if all you need is really just the count of the columns per row, then try the following.

If there's a need to avoid pandas and use the values in CFD directly, then you would have to make use of the ConditionalFreqDist.values() and iterate through it carefully.

If we do:

>>> list(cfd_appr.values())
[FreqDist({3: 6, 6: 5, 1: 5, 9: 4, 4: 4, 2: 3, 8: 2, 10: 2, 7: 1, 14: 1}),
 FreqDist({4: 10, 3: 9, 1: 5, 7: 4, 2: 4, 5: 3, 6: 3, 11: 1, 9: 1})]

We'll see a list of FreqDist, each one respective to the keys (in this case the filenames):

>>> list(cfd_appr.keys())
['books_large_p1.txt', 'books_large_p2.txt']

Since we know that FreqDist is a subclass of collections.Counter object, if we sum the values of each Counter object, we will get:

>>> [sum(fd.values()) for fd in cfd_appr.values()]
[33, 40]

Which outputs the same values as df.sum(axis=1) above.

So to put it together:

>>> dict(zip(cfd_appr.keys(), [sum(fd.values()) for fd in cfd_appr.values()]))
{'books_large_p1.txt': 33, 'books_large_p2.txt': 40}
alvas
  • 115,346
  • 109
  • 446
  • 738
  • Well, it's much more complex than I expected =D And I may be wrong, but the result is the same as by using `.sum(axis=1`... At least, I seems to be so after I tried out this solution, with one difference (`dtype: int64`). Is there really no way to sum results within the definition of `cfd_appr`? Apparently, I didn't manage to formulate my question clear enough, sorry... I thought the problem appeared just because of my misunderstanding of Python's syntax. – Gavrk Feb 19 '20 at 08:31
  • There's a way to sum the results from cfd_appr but you won't find be satisfied with the complexities =) Thus the proposed solution to casting to DataFrame – alvas Feb 19 '20 at 08:34
  • For completeness, I've added the answer to access attributes directly from `cfd_appr` – alvas Feb 19 '20 at 08:43
  • This piece of code is to construct a DataFrame that will be joined with other DataFrames by condition (it's for a Kivy App), thus, I have two concerns: First, each column should have a header (in this case, `W. appear.`), and I'm not able to give a header to the column in this kind of table (as described in my question). My second concern is that I'll be not able to join `dtype: int64` DataFrames with other DataFrames that are not `dtype: int64`. – Gavrk Feb 19 '20 at 08:43
  • 1
    Hint: If you don't want to enforce int64, you don't need the `.astype(int)` in the code. – alvas Feb 19 '20 at 08:46
  • Hint: If you don't want the DataFrame to be indexed, you can remove `index=cfd_appr.keys()`. Or https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html – alvas Feb 19 '20 at 08:49
  • 1
    Thanks a lot! Hope it's not an off-top. Two questions. 1: How to put that dictionary into a Dataframe? I'm getting `ValueError: If using all scalar values, you must pass an index`. And second, `counts_per_row` is a DataFrame without header. Is there a way to give a name to its single column? `.rename({None: 'W. appear.'}, axis='columns')` is not working (obviously, because there is no header at all), and `.rename_axis('W. appear.')` renames axis itself, not the column. – Gavrk Feb 19 '20 at 09:25
  • 1
    I believe in you, I'm sure some googling will get you the answers =) – alvas Feb 19 '20 at 09:42
0

Well, here is what was actually needed:

First, get the numbers of words of different length (just as I did before):

cfd_appr = nltk.ConditionalFreqDist(
    (textname, num_appr)
    for textname in corpus.fileids()
    for num_appr in [len(w) for w in corpus.raw(fileids=textname).replace("\r", " ").replace("\n", " ").split()])

Then add import DataFrame as pd and add to_frame(1) to the dtype: float64 Series that I got by summing the columns:

pd.DataFrame(cfd_appr).transpose().sum(axis=1).to_frame(1)

That's it. However, if somebody knows how to sum them uo in the definition of cfd_appr, that would be some more elegant solution.

Gavrk
  • 295
  • 1
  • 4
  • 16