3

Having a large DataFrame of text, I want to first train and LDA model on it. So I do:

doc_clean = df['tweet_tokenized'].tolist()
dictionary = corpora.Dictionary(doc_clean)
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
lda = LdaMulticore(doc_term_matrix, id2word=dictionary, num_topics=50)

Now that I have my trained lda, I want to iterate throw df row by row and put the probability of each row belonging to a given topic to its corresponding column. So, first I create 50 columns of zeros:

for i in range(50):
    col_name = 'tweet_topic_'+str(i)
    df[col_name] = 0

Then I iterate through the rows using iterrows() and update the values using the at method:

for row_index, row in df.iterrows():
    new_doc = dictionary.doc2bow(row['tweet_tokenized'])
    lda_result = lda[new_doc]
    for topic in lda_result:
        col_name = 'tweet_topic_'+(str(topic[0]))
        df.at[row_index,col_name] = topic[1]

But it doesn't work properly and the values of the above 50 columns doesn't change and remain zeros.

Any idea how should I resolve this?

UPDATE: I added row = row.copy() and replaced at with loc and it works well now.

So here is the working code:

for row_index, row in df.iterrows():
    row = row.copy()
    new_doc = dictionary.doc2bow(row['tweet_tokenized'])
    lda_result = lda[new_doc]
    for topic in lda_result:
        col_name = 'tweet_topic_'+(str(topic[0]))
        df.loc[row_index,col_name] = topic[1]
msmazh
  • 785
  • 1
  • 9
  • 19
  • Can you clarify what you mean by "it doesn't work properly?" – Evan Dec 03 '18 at 20:22
  • What do the values for `'tweet_topic_'+(str(topic[0]))` look like if you print them out? – Evan Dec 03 '18 at 20:29
  • @Evan by not working properly I mean it doesn't get updated. All values remain zeros, as initially set to. – msmazh Dec 03 '18 at 20:30
  • @Evan I did the print('tweet_topic_'+str(topic[0])) and it works well. It'll give: tweet_topic_1, tweet_topic_2, tweet_topic_3, etc. – msmazh Dec 03 '18 at 20:32
  • Can you post or link to some sample data? Are there 50 topics in each `lda_result`? – Evan Dec 03 '18 at 21:16
  • @Evan lda_result will be a list of few tuples (mostly one tuple). For example, it'll be: [(1, 0.45), (4, 0.37)], meaning the text in this specific row belongs to topic 1 with 0.45 probability and to topic 3 with 0.37. – msmazh Dec 03 '18 at 21:21
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/184649/discussion-between-evan-and-msmazh). – Evan Dec 03 '18 at 21:35
  • @Evan I resolved the issue. Please see the update in the post. Many thanks. – msmazh Dec 03 '18 at 21:50

1 Answers1

2

Using instructions in the following post, I was able to resolve it:

Updating value in iterrow for pandas

for row_index, row in df.iterrows():
    row = row.copy()
    new_doc = dictionary.doc2bow(row['tweet_tokenized'])
    lda_result = lda[new_doc]
    for topic in lda_result:
        col_name = 'tweet_topic_'+(str(topic[0]))
        df.loc[row_index,col_name] = topic[1]
msmazh
  • 785
  • 1
  • 9
  • 19