1

I am very new to Python. I have a list of tuples, where I created bigrams.

This question is pretty close to my needs

my_list = [('we', 'consider'), ('what', 'to'), ('use', 'the'), ('words', 'of')]

Now I am trying to convert this into a frequency matrix

The desired output is

          consider  of  the  to  use  we  what  words
consider         0   0    0   0    0   0     0      0
of               0   0    0   0    0   0     0      0
the              0   0    0   0    0   0     0      0
to               0   0    0   0    0   0     0      0
use              0   0    1   0    0   0     0      0
we               1   0    0   0    0   0     0      0
what             0   0    0   1    0   0     0      0
words            0   1    0   0    0   0     0      0

How to do this, using numpy or pandas? I can see something with nltk only, unfortunately.

Anakin Skywalker
  • 2,400
  • 5
  • 35
  • 63

2 Answers2

1

You can create frequancy data frame and call index-values by words:

words=sorted(list(set([item for t in my_list for item in t])))
df = pd.DataFrame(0, columns=words, index=words)
for i in my_list:
  df.at[i[0],i[1]] += 1

output:

          consider  of  the  to  use  we  what  words
consider         0   0    0   0    0   0     0      0
of               0   0    0   0    0   0     0      0
the              0   0    0   0    0   0     0      0
to               0   0    0   0    0   0     0      0
use              0   0    1   0    0   0     0      0
we               1   0    0   0    0   0     0      0
what             0   0    0   1    0   0     0      0
words            0   1    0   0    0   0     0      0

Note that in this one, the order in the bigram matters. If you don't care about order, you should sort the tuples by their content first, using this:

my_list = [tuple(sorted(i)) for i in my_list]

Another way is to use Counter to do the count, but I expect it to be similar performance(again if order in bigrams matters, remove sorted from frequency_list):

from collections import Counter

frequency_list = Counter(tuple(sorted(i)) for i in my_list)
words=sorted(list(set([item for t in my_list for item in t])))
df = pd.DataFrame(0, columns=words, index=words)
for k,v in frequency_list.items():
  df.at[k[0],k[1]] = v

output:

          consider  of  the  to  use  we  what  words
consider         0   0    0   0    0   1     0      0
of               0   0    0   0    0   0     0      1
the              0   0    0   0    1   0     0      0
to               0   0    0   0    0   0     1      0
use              0   0    0   0    0   0     0      0
we               0   0    0   0    0   0     0      0
what             0   0    0   0    0   0     0      0
words            0   0    0   0    0   0     0      0
Ehsan
  • 12,072
  • 2
  • 20
  • 33
1

If you do not care about speed too much you could use for loop.

import pandas as pd
import numpy as np
from itertools import product

my_list = [('we', 'consider'), ('what', 'to'), ('use', 'the'), ('words', 'of')]

index = pd.DataFrame(my_list)[0].unique()
columns = pd.DataFrame(my_list)[1].unique()
df = pd.DataFrame(np.zeros(shape=(len(columns), len(index))),
                  columns=columns, index=index, dtype=int)

for idx,col in product(index, columns):
    df[col].loc[idx] = my_list.count((idx, col))

print(df)

Output:

       consider  to  the  of
we            1   0    0   0
what          0   1    0   0
use           0   0    1   0
words         0   0    0   1
sszokoly
  • 64
  • 1
  • 7
  • 1
    If you need N x N sparse matrix the accepted answer is better. If you want to keep your matrix size to the absolute minimum and it does not have to be symmetric this gives you that. – sszokoly Jul 17 '20 at 06:24
  • My matrix is pretty big, 10000 * 10000, so not sure if loop is a good idea, but I will use your method with smaller matrices! Thanks! – Anakin Skywalker Jul 25 '20 at 23:25