-1

I have 13 different lists of words. As I am doing topic modelling, I want to clean them, create corpus, get_document_topics and concatenate the results of all the lists. The code for doing the process over one list i.e. eastern_data_words is shown below. I want to apply these same steps to all the remaining 12 lists. I believe I should create a dictionary of these lists and then somehow loop over

Remove Stop Words

eastern_data_words_nostops = remove_stopwords(eastern_data_words)

Form Bigrams

eastern_data_words_bigrams = make_bigrams(eastern_data_words_nostops)


nlp = spacy.load("en_core_web_md", disable=['parser', 'ner'])

Do lemmatization keeping only noun, adj, vb, adv

eastern_data_lemmatized = lemmatization(eastern_data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

Create Dictionary

id2word_reg = corpora.Dictionary(eastern_data_lemmatized)

Create Corpus

texts_reg = eastern_data_lemmatized

Term Document Frequency

corpus_reg = [id2word_reg.doc2bow(text) for text in texts_reg]

#Getting weights of topics in all the documents

topics = [lda_model_tuned[corpus_reg[i]] for i in range(len(eastern))]

def topics_document_to_dataframe(topics_document, num_topics):
    res = pd.DataFrame(columns=range(num_topics))
    for topic_weight in topics_document:
        res.loc[0, topic_weight[0]] = topic_weight[1]
    return res

document_topic = pd.concat([topics_document_to_dataframe(topics_document, num_topics=8) for `topics_document in topics]).reset_index(drop=True).fillna(0)`

eastern_weights= document_topic.apply(np.mean, axis=0)

At the end i want a dataframe with the weights of different topics as columns and the list names as rows. Example of one column is shown in the image. output

  • I'm not completely following the problem you're trying to solve. Are you asking how to combine the 12 lists into one list so you can iterate over all of them? What form is the data in currently? When you say you have 12 lists of data, are you saying you have something like 12 distinct text files? Excel Docs? Dataframes? – Hofbr Aug 26 '20 at 00:22
  • the lists are separate text files. they are distinct as they are tweets from different regions. Each region's tweets are in the form of separate lists. So they need to be analyzed separately. However, the code for analyzing them is same so it gets very repetitive. I am just trying to use a loop to solve it somehow, – Sannia Nasir Aug 26 '20 at 00:30
  • In that case, you should be able to just iterate through the text files and run the process you've outlined above on each text file. – Hofbr Aug 26 '20 at 00:35

2 Answers2

0

If the starting input is one of the 13 list of words followed by all the steps you outline, you should be able to make all the steps into a function that accepts a list of words as input. Then you can just loop over the func for each list.

pseudocode:

def func(list_of_words):
   all your steps here

   return processed_list

word_list = [list_1, list_2, ... , list_13]
lists = []
for l in word_list:
   lists.append(func(l))
footfalcon
  • 581
  • 5
  • 16
0

Based on your response to my comment, It sounds like you're really looking for how to iterate over files.

This stackoverflow solution is a good place to start I think. How can I iterate over files in a given directory?

In order for us to provide more direction, you should include the process you're currently using to import the data. The code above is more about the processing of it, which it sounds like you already have figured out.

Hofbr
  • 868
  • 9
  • 31