Efficient calculation of point mutual information in the text corpus in Python

Question

I have a corpus, in which I calculate the frequency of unigrams and skipgrams, normalize the values by dividing them by the sum of all frequencies, and feed them into pandas data frames. Now, I would like to calculate the point mutual information of each skipgram, which is the log of normalized frequency of skipgram divided by the multiplied normalized frequencies of both unigrams in the skipgram.

My data frames look like this:

unigram_df.head()
              word  count      prob
0          nordisk      1  0.000007
1           lments      1  0.000007
2             four     91  0.000593
3          travaux      1  0.000007
4  cancerestimated      1  0.000007

skipgram_df.head()
                      words  count      prob
0                 (o, odds)      1  0.000002
1  (reported, pretreatment)      1  0.000002
2       (diagnosis, simply)      1  0.000002
3           (compared, sbx)      1  0.000002
4             (imaging, or)      1  0.000002

For now, I calculate the PMI values of each skipgram, by iterating through each row of skipgram_df, extracting the prob value of the skipgram, extracting prob values of both unigrams, and then calculating the log, and appending the results into the list.

The code looks like this, and it works fine:

for row in skipgram_df.itertuples():
    skipgram_prob = float(row[3])
    x_unigram_prob = float(unigram_df.loc[unigram_df['word'] == str(row[1][0])]['prob'])
    y_unigram_prob = float(unigram_df.loc[unigram_df['word'] == str(row[1][1])]['prob'])
    pmi = math.log10(skipgram_prob/(x_unigram_prob*y_unigram_prob))
    pmi_list.append(pmi)

The problem is that it takes long to iterate through the whole dataframe (around 30 minutes on 300,000 skipgrams). I will have to work on corpora that are even 10-20 times bigger than that, so I am looking for a more efficient way to do that. Can anyone suggest another solution that will be quicker? Thank you.

Are `skipgram_df['words'` strings or tuples? – wwii Nov 11 '17 at 16:48 — wwii, Nov 11 '17 at 16:48

score 0 · Answer 1 · answered Jan 22 '19 at 01:51

0

I am also trying to solve something similar. I do not know how to improve the performance of the code, but you could parallelize it because each calculation is independent from the other. Pandas df.iterrow() parallelization

answered Jan 22 '19 at 01:51

ivangtorre

661
1
8
21

Efficient calculation of point mutual information in the text corpus in Python

1 Answers1