Converting nested dictionary to dataframe with the keys as rownames and the dictionaries in the values as columns?

Question

I have a dataframe that consists of a large number of frequency counts, where the column labels are features being counted and row labels are the pages in which features are being counted. I need to find the probability of each feature occurring across all pages, so I'm trying unsuccessfully to iterate through each column, dividing each sum by the sum of all columns, and save the result in a dictionary as the value corresponding to a key which is taken from the column labels.

My dataframe looks something like this:

    |---------|----------|
    | Word1   | Word2    | 
----|---------|----------|
pg1 |    0    |     1    |
----|---------|----------|
pg2 |    3    |     2    |
----|---------|----------|
pg3 |    9    |     0    |
----|---------|----------|
pg4 |    1    |     6    |
----|---------|----------|
pg5 |    2    |     3    |
----|---------|----------|
pg6 |    0    |     2    |
----|---------|----------|

And I want my output to be a dictionary with the words as the keys and the sum(column) / sum(table) as the values, like this:

{ Word1: .517 ,  Word2: .483 }

So far I've attempted the following:

dict = {}
for x in df.sum(axis = 0):
    dict[x] = x / sum(df.sum(axis = 0))
print(dict)

but the command never completes. I'm not sure whether I've done something wrong in my code or whether perhaps my laptop simply doesn't have the ability to deal with the size of my dataset.

Can anyone point me in the right direction?

Could you how some sample input/expected output as it's not exactly clear what you're trying to achieve here? If you have a look at https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples and [edit] your question accordingly that'll help people help you. — Jon Clements, Jan 26 '20 at 18:44
@needaclue I think so... made an answer anyway... should work given what you've shown (hopefully!) — Jon Clements, Jan 26 '20 at 20:28

score 1 · Accepted Answer · answered Jan 26 '20 at 20:21

1

It looks like you can take the sum of each column and then divide by the flattened values of the sum across the entire underlying arrays in the DF, eg:

df.sum().div(df.values.sum()).to_dict()

That'll give you:

{'Word1': 0.5172413793103449, 'Word2': 0.4827586206896552}

answered Jan 26 '20 at 20:21

Jon Clements

138,671
33
247
280

Why the `.values` in `df.values.sum()`? – AMC Jan 27 '20 at 01:06
@AMC `df.values` gives you the underlying numpy array(s) and its default `.sum()` behaviour is *all* elements not by row/columns... otherwise, with pandas, you have to do `.sum().sum()` – Jon Clements Jan 27 '20 at 02:30

Converting nested dictionary to dataframe with the keys as rownames and the dictionaries in the values as columns?

1 Answers1