I have a large dataframe (~500,000 rows). Processing each row gives me a Counter object (a dictionary with objects counts). The output I want is a new dataframe which column headers are the objects that are being counted (the keys in the dictionary). I am looping over the rows, however it takes very long.I know that loops should be avoided in Pandas, any suggestion?
out_df = pd.DataFrame()
for row in input_df['text']:
tokens = nltk.word_tokenize(row)
pos = nltk.pos_tag(tokens)
count = Counter(elem[1] for elem in pos)
out_df = out_df.append(count, ignore_index=True)
for indication, Counter(elem[1] for elem in pos)
looks like Counter({'NN':8, 'VBZ': 2, 'DT':3, 'IN': 4})