1

I have a dataframe that consists of a large number of frequency counts, where the column labels are features being counted and row labels are the pages in which features are being counted. I need to find the probability of each feature occurring across all pages, so I'm trying unsuccessfully to iterate through each column, dividing each sum by the sum of all columns, and save the result in a dictionary as the value corresponding to a key which is taken from the column labels.

My dataframe looks something like this:

    |---------|----------|
    | Word1   | Word2    | 
----|---------|----------|
pg1 |    0    |     1    |
----|---------|----------|
pg2 |    3    |     2    |
----|---------|----------|
pg3 |    9    |     0    |
----|---------|----------|
pg4 |    1    |     6    |
----|---------|----------|
pg5 |    2    |     3    |
----|---------|----------|
pg6 |    0    |     2    |
----|---------|----------|

And I want my output to be a dictionary with the words as the keys and the sum(column) / sum(table) as the values, like this:

{ Word1: .517 ,  Word2: .483 }

So far I've attempted the following:

dict = {}
for x in df.sum(axis = 0):
    dict[x] = x / sum(df.sum(axis = 0))
print(dict)

but the command never completes. I'm not sure whether I've done something wrong in my code or whether perhaps my laptop simply doesn't have the ability to deal with the size of my dataset.

Can anyone point me in the right direction?

Jon Clements
  • 138,671
  • 33
  • 247
  • 280
needaclue
  • 25
  • 4

1 Answers1

1

It looks like you can take the sum of each column and then divide by the flattened values of the sum across the entire underlying arrays in the DF, eg:

df.sum().div(df.values.sum()).to_dict()

That'll give you:

{'Word1': 0.5172413793103449, 'Word2': 0.4827586206896552}
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • Why the `.values` in `df.values.sum()`? – AMC Jan 27 '20 at 01:06
  • @AMC `df.values` gives you the underlying numpy array(s) and its default `.sum()` behaviour is *all* elements not by row/columns... otherwise, with pandas, you have to do `.sum().sum()` – Jon Clements Jan 27 '20 at 02:30