I have a corpus, in which I calculate the frequency of unigrams and skipgrams, normalize the values by dividing them by the sum of all frequencies, and feed them into pandas data frames. Now, I would like to calculate the point mutual information of each skipgram, which is the log of normalized frequency of skipgram divided by the multiplied normalized frequencies of both unigrams in the skipgram.
My data frames look like this:
unigram_df.head()
word count prob
0 nordisk 1 0.000007
1 lments 1 0.000007
2 four 91 0.000593
3 travaux 1 0.000007
4 cancerestimated 1 0.000007
skipgram_df.head()
words count prob
0 (o, odds) 1 0.000002
1 (reported, pretreatment) 1 0.000002
2 (diagnosis, simply) 1 0.000002
3 (compared, sbx) 1 0.000002
4 (imaging, or) 1 0.000002
For now, I calculate the PMI values of each skipgram, by iterating through each row of skipgram_df, extracting the prob value of the skipgram, extracting prob values of both unigrams, and then calculating the log, and appending the results into the list.
The code looks like this, and it works fine:
for row in skipgram_df.itertuples():
skipgram_prob = float(row[3])
x_unigram_prob = float(unigram_df.loc[unigram_df['word'] == str(row[1][0])]['prob'])
y_unigram_prob = float(unigram_df.loc[unigram_df['word'] == str(row[1][1])]['prob'])
pmi = math.log10(skipgram_prob/(x_unigram_prob*y_unigram_prob))
pmi_list.append(pmi)
The problem is that it takes long to iterate through the whole dataframe (around 30 minutes on 300,000 skipgrams). I will have to work on corpora that are even 10-20 times bigger than that, so I am looking for a more efficient way to do that. Can anyone suggest another solution that will be quicker? Thank you.