0

I have a structured dataset with columns 'text' and 'topic'. Someone has already conducted a word embedding/topic modeling so each row in 'text' is assigned a topic number (1-200). I would like to create a new data frame with the topic number and the top 5-10 key words that represent that topic.

I've done this before, but I usually start from scratch and run an LDA model. Then use the objects created by the LDA to find keywords per topic. That said, I'm starting from a mid-point that my supervisor gave me, and it's throwing me off.

The data structure looks like below:

import pandas as pd
df = pd.DataFrame({'text': ['foo bar baz', 'blah bling', 'foo'], 
               'topic': [1, 2, 1]})

So would the plan be to create a bag of words, groupby 'topic,' and count the words? Or is there a keywords function and group by a column option that I don't know about in gensim or nltk?

abombz
  • 39
  • 4
  • You can use print_topic() or print_topics() method from the gensim package. An example is given [here](https://stackoverflow.com/questions/15016025/how-to-print-the-lda-topics-models-from-gensim-python). – vb_rises Jun 27 '19 at 12:49
  • @Vishal Those all assume that I have already run the LDA on my computer. I have not, I only have the topics that were given to me. – abombz Jun 27 '19 at 12:58
  • ok. Then you need to combine words by topics and create a dictionary of word counter. – vb_rises Jun 27 '19 at 13:06
  • @Vishal Ok. Do you have a link/tutorial for something like that. Thank you for your help. – abombz Jun 27 '19 at 13:11
  • Check my answer. I am not sure if this is your requirement. – vb_rises Jun 27 '19 at 14:39

2 Answers2

1

I have created a dictionary where keys are the topic and text is the string of words appending each topic's words.

d = dict()
for index, ser in df.iterrows():
    print(index, df.loc[index]['text'])
    topic  = df.loc[index]['topic']
    if topic not in d.keys():
        d[df.loc[index]['topic']] = ""
    d[df.loc[index]['topic']] += ( df.loc[index]['text']) + " "

print(d)
#Output
{1: 'foo bar baz foo ', 2: 'blah bling '}

Then I have used the Counter package to get frequency of words for each topic.

from collections import Counter
for key in d.keys():
    print(Counter(d[key].split()))

#Output
Counter({'foo': 2, 'baz': 1, 'bar': 1})
Counter({'blah': 1, 'bling': 1})
vb_rises
  • 1,847
  • 1
  • 9
  • 14
0

I think this works:

test = pd.DataFrame(df.groupby("topic")['document'].apply(lambda documents: ''.join(str(documents))))

from nltk import Metric, Rake

r = Rake(ranking_metric= Metric.DEGREE_TO_FREQUENCY_RATIO, language= 'english', min_length=1, max_length=4)

r.extract_keywords_from_text(test.document[180])
r.get_ranked_phrases()

I just need to figure out how to loop in through for each topic and append it to a dataframe.

abombz
  • 39
  • 4