1

I have a large dataframe (~500,000 rows). Processing each row gives me a Counter object (a dictionary with objects counts). The output I want is a new dataframe which column headers are the objects that are being counted (the keys in the dictionary). I am looping over the rows, however it takes very long.I know that loops should be avoided in Pandas, any suggestion?

out_df = pd.DataFrame()
for row in input_df['text']:
    tokens = nltk.word_tokenize(row)
    pos = nltk.pos_tag(tokens)
    count = Counter(elem[1] for elem in pos)
    out_df = out_df.append(count, ignore_index=True)

for indication, Counter(elem[1] for elem in pos) looks like Counter({'NN':8, 'VBZ': 2, 'DT':3, 'IN': 4})

Bill
  • 10,323
  • 10
  • 62
  • 85
Sophie
  • 13
  • 2

2 Answers2

0

Using append on a dataframe is quite inefficient I believe (has to reallocate memory for the entire data frame each time).

DataFrames were meant for analyzing data and easily adding columns—but not rows.

So I think a better approach would be to create list first (lists are mutable) and convert it to a dataframe at the end.

I'm not familiar with nltk so I can't actually test this but something along the following lines should work:

out_data = []
for row in input_df['text']:
    tokens = nltk.word_tokenize(row)
    pos = nltk.pos_tag(tokens)
    count = Counter(elem[1] for elem in pos)
    out_data.append(count)
out_df = pd.DataFrame(out_data)

You might want to add the following to remove any NaNs and convert the final counts to integers:

out_df = out_df.fillna(0).astype(int)

And delete the list after to free up the memory:

del out_data
Bill
  • 10,323
  • 10
  • 62
  • 85
  • 1
    Thank you. It did the job in about 15mins – Sophie May 02 '21 at 03:03
  • Excellent. If you need to do it faster then a parallel-processing solution might be the next step. This would be quite easy to do with [Dask](https://dask.org) I think. See [this answer](https://stackoverflow.com/a/65498971/1609514) I wrote for a file processing problem. – Bill May 02 '21 at 16:15
0

I think must use a vecotrized solution maybe: "Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed and can be avoided (using) a vectorized solution: many operations can be performed using built-in methods or NumPy functions, (boolean) indexing." From https://towardsdatascience.com/you-dont-always-have-to-loop-through-rows-in-pandas-22a970b347ac

Tprograma
  • 21
  • 5
  • I couldn't find a way to vectorize the function that processes each row - which would have probabbly been the most efficient way. – Sophie May 02 '21 at 02:59