0

I am new in natural language processing and I found this interesting tutorial which describes how to do the topic modeling.

Available data for this tutorial

Source code: here

The above code can provide topic modeling using LDA and generates the k number of topic. My question is how can I find which document belongs to which topic (cluster)? Like the example shown in figure here. I wondering something like:

publish_date:20030219 with text (aba ...) belongs to topic 1 cluster or ..

I already read the post such as: [1] or [2] but still, I couldn't get my answer.

I also tried Matlab text analytic toolbox but I couldn't figure that out yet.

It would be great if you can provide me any help.

Bilgin
  • 499
  • 1
  • 10
  • 25

1 Answers1

1

you can pass your document through like this:

a = lda_model[bow_corpus[:]]

Create your topic arrays:

topic_0=[]
topic_1=[]
topic_2=[]

for i in a:
    topic_0.append(i[0][1])
    topic_1.append(i[1][1])
    topic_2.append(i[2][1])

Then put it in a csv and find the max value

d = {'topic_0': topic_0,
     'topic_1': topic_1,
     'topic_2': topic_2}

df = pd.DataFrame(data=d)
df.to_csv("YourCSV.csv", index=True, mode = 'a')

You can also look at the scores for a single row:

lda_model[bow_corpus[123]]

I hope this helps :)

Sara
  • 1,162
  • 1
  • 8
  • 21
  • 1
    thank you for your comment and code. It helped me to understand more about lda. I would like to know if there is any way to only save the output of the highest prob for each document topic. Like: document 1 belong to topic 2 with the highest probability of i.e. 0.97. Thank you – Bilgin May 23 '19 at 14:09
  • You can write a formula like: if max(topic_0, topic_1, topic_2) = topic_1 then 'topic_1' elseif max(topic_0, topic_1, topic_2) = topic_2 then 'topic_2' else 0 endif – Sara May 24 '19 at 16:27