I have a dataframe that consists of a large number of frequency counts, where the column labels are features being counted and row labels are the pages in which features are being counted. I need to find the probability of each feature occurring across all pages, so I'm trying unsuccessfully to iterate through each column, dividing each sum by the sum of all columns, and save the result in a dictionary as the value corresponding to a key which is taken from the column labels.
My dataframe looks something like this:
|---------|----------|
| Word1 | Word2 |
----|---------|----------|
pg1 | 0 | 1 |
----|---------|----------|
pg2 | 3 | 2 |
----|---------|----------|
pg3 | 9 | 0 |
----|---------|----------|
pg4 | 1 | 6 |
----|---------|----------|
pg5 | 2 | 3 |
----|---------|----------|
pg6 | 0 | 2 |
----|---------|----------|
And I want my output to be a dictionary with the words as the keys and the sum(column) / sum(table) as the values, like this:
{ Word1: .517 , Word2: .483 }
So far I've attempted the following:
dict = {}
for x in df.sum(axis = 0):
dict[x] = x / sum(df.sum(axis = 0))
print(dict)
but the command never completes. I'm not sure whether I've done something wrong in my code or whether perhaps my laptop simply doesn't have the ability to deal with the size of my dataset.
Can anyone point me in the right direction?