Generating text corpus from a matrix, based on words and their weighted probabilities

Question

I have a matrix, and I am trying to generate text corpus.

             chewbacca  darth  han  leia  luke  obi
chewbacca          0      0    0     0   0.66 0.33
darth              0      0    0     1     0    0
han                0      0    0     0     1    0
leia               0      0    0     0     1    0
luke               0      0    0     0     0    0
obi                0      0    0     0     0    0

I selected the work chewbacca as my first word.

Now I am trying to find pairs for chewbacca, based on probabilities. Two words are here - luke(0.66) and obi (0.33).

The second word must be based on weighted probabilities.

For instance, if "luke" pairs with "chewbacca" as 0.66 and "obi" pairs with "chewbacca" as 0.33, "luke" must be selected twice more likely than "obi".

How to approach it? Appreciate any tips!

@Yoben, thanks for your response, I am very new to NLP, so I am trying to understand the best way to do it :(. My further goal will be to generate words like this in a loop and generate sentences out of these words. I hope that somebody with NLP expertise can see this question and give me some tips. — Anakin Skywalker, Jul 24 '20 at 00:53
Follow up question , how should we compare 1 from other row with 0.66 — BENY, Jul 24 '20 at 00:55
@YOBEN_S, as far as I understand so far, the total probability is based on values in ROWS. Hence, 1 is not very relevant here. If select "leia", then 1 will be a probability for "luke". — Anakin Skywalker, Jul 24 '20 at 00:57
Yes, these are bigrams. I created a list of bigrams with frequency of each one. Then I normalized counts and turned them into probabilities. Now tryting to understand how to generate the text corpus, based on my ideas above. — Anakin Skywalker, Jul 24 '20 at 02:11

Ehsan · Accepted Answer · 2020-07-24T02:54:43.187

1

If you want to create a corpus of bigrams:

#remove rows that sum to 0
df = df.loc[df.sum(axis=1) != 0]
#normalizing row sum to 1
df = df.div(df.sum(axis=1), axis=0).fillna(0)
#number of bigrams you wish to generate for each row, you can change it by row as well
num_bigrams_per_word = 3
df['bigrams'] = df.apply(lambda x:[x.name+' '+s for s in np.random.choice(df.columns,p=x.values,size=num_bigrams_per_word)], axis=1)
corpus = df.bigrams.sum()

Example output:

['chewbacca obi', 'chewbacca obi', 'chewbacca luke', 'darth leia', 'darth leia', 'darth leia', 'han luke', 'han luke', 'han luke', 'leia luke', 'leia luke', 'leia luke']

edited Jul 24 '20 at 02:54

answered Jul 24 '20 at 02:39

Ehsan

12,072
2
20
33

1

@AnakinSkywalker Please let us know if you are looking for sth different than bigram corpus – Ehsan Jul 24 '20 at 02:40
Ehsan, I appreciate your help! I am looking for different. I have a corpus already. I assigned values for each bigram. I created a datafarme based on it and then created probabilities. Now my goal is to generate text based on all this matrix with probabilities. And I am trying to understand how to do it – Anakin Skywalker Jul 24 '20 at 02:54
1

@AnakinSkywalker I think that is a question out of scope of SO then. I assume there are many approaches to it. If you are trying to find a methodology to generate a text from only bigrams, you would need to search for it. SO is meant to help you code what you already know you want. – Ehsan Jul 24 '20 at 02:57
Yeah, I understand, I was hoping that somebody with NLP experience will stop by and might help me. – Anakin Skywalker Jul 24 '20 at 02:59

Generating text corpus from a matrix, based on words and their weighted probabilities

1 Answers1