0

I have multiple text columns. I want to use bag of words for each text column, then create a new bag of words dataframe for each text column. This is what I have:

text_df = [['text response', 'another response'], ['written responses', 'more text'], ['lots more text', 'text text']]
text_df = pd.DataFrame(text_df, columns = ['answer1', 'answer2'])

def bow (tokens, data):
    tokens = tokens.apply(nltk.word_tokenize)
    cvec = CountVectorizer(min_df=.01, ngram_range=(1,3), tokenizer=lambda doc:doc, lowercase=False)
    cvec.fit(tokens)
    cvec_counts = cvec.transform(tokens)
    cvec_counts_bow = cvec_counts.toarray()
    vocab = cvec.get_feature_names()
    bow_model = pd.DataFrame(cvec_counts_bow, columns=vocab)
    return bow_model

answers = ['answer1', 'answer2']

for a in answer_list:
    a = bow(text_df[a], a)

I want 2 dataframes, one called answer1 and one called answer2, each with their own bag of words. But, I end up with one dataframe called "a" with only bag of words for answer2.

Any ideas how to fix this?

Kim S.
  • 47
  • 5

1 Answers1

0

Please trace your code properly. You did get two data frames, but you discarded the all but the last one. You need to save them all (both):

frame_list = [bow(text_df[a], a) for a in answer_list]

Also, please note that you used a very dangerous practice: you overwrote your loop index, a, with a different value while inside the loop.

If you do want the loop format, use a different variable and save the results:

frame_list = []
for answer in answer_list:
    frame_list.append(bow(text_df[answer], answer))

Here, I use answer to iterate through the list, but the variable I change is frame_list.


update per OP comment:

See How to create variable variables. When you expect your program to dynamically modify its name space, you create a dangerous functionality -- one that usually serves no design purpose. Instead, either create a list of data (as I did in my solution), or -- if your generated names do have some significance externally -- properly treat those labels as data ... use a dict:

frame_table = {}
for idx, answer in enumerate(answer_list):
    frame_table["answer" + str(idx+1)] = (bow(text_df[answer], answer))

This will give you two dict entries, answer1 and answer2.

Prune
  • 76,765
  • 14
  • 60
  • 81