4

I have a pandas dataframe with a text column. Now I want to group this dataframe and concatenate the text columns - here's some code to generate an example dataframe:

import numpy as np
import pandas as pd

import string
import random

def text_generator(size=6, chars=string.ascii_lowercase):
    return ''.join(random.choice(chars) for _ in range(size))

items, clusters, texts = [], [], []
for item in range(200):
    for cluster in range(1000):
        for line in range(random.randint(1, 4)):
            items.append(item)
            clusters.append(cluster)
            texts.append(text_generator())
df = pd.DataFrame({'item_id': items, 'cluster_id': clusters, 'text': texts})

Now I group by the columns 'item_id' and 'cluster_id' and create a new dataframe for the aggregated result:

grouped = df.groupby(('item_id', 'cluster_id'))
df_cluster = pd.DataFrame(grouped.size()).rename(columns={0: 'cluster_size'})

Maybe I'm wrong, but the obvious solution seems to be to aggregate the text like this:

df_cluster['texts'] = grouped.text.agg(lambda x: ' '.join(x))

But this takes about 10s. For a few megabytes of data? Weird. So I tested a standard python solution for this:

text_lookup = {}
for item_id, cluster_id, text in zip(df.item_id.values, df.cluster_id.values, df.text.values):
    text_lookup.setdefault((item_id, cluster_id), []).append(text)
item_ids, cluster_ids, all_texts = [], [], []
for (item_id, cluster_id), texts in text_lookup.items():
    item_ids.append(item_id)
    cluster_ids.append(cluster_id)
    all_texts.append(' '.join([t for t in texts if t is not np.nan]))
df_tags = pd.DataFrame({'item_id': item_ids, 'cluster_id': cluster_ids, 'texts': all_texts}).set_index(['item_id', 'cluster_id'])
df_cluster = df_cluster.merge(df_tags, left_index=True, right_index=True)

This should be much slower, because I'm doing all this for loops in python, but it takes only 3s. Probably I'm doing something wrong, but I don't now what :).

ephes
  • 1,451
  • 1
  • 13
  • 19
  • Have you tried: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.cat.html ? – michael_j_ward Jun 09 '16 at 16:04
  • @ephes, So, you have a question on which code, exactly? – Merlin Jun 09 '16 at 16:11
  • @Merlin How to do 'df_cluster['texts'] = grouped.text.agg(lambda x: ' '.join(x))' but fast :). – ephes Jun 09 '16 at 16:14
  • @michael_j_ward 'df_cluster['texts'] = grouped.text.str.cat(sep=' ')' raises an AttributeError – ephes Jun 09 '16 at 16:16
  • @ephes I meant like `df_cluster['texts'] = grouped.text.agg(lambda x: x.str.cat(sep=' '))`. But it is still slow. I'm interested to hear the answer to this so great question. – michael_j_ward Jun 09 '16 at 16:18
  • [For loops with pandas - When should I care?](https://stackoverflow.com/questions/54028199/for-loops-with-pandas-when-should-i-care) will probably answer your question. – cs95 Feb 05 '19 at 23:29
  • using `grouped.agg(lambda x: x.values.sum())` I get almost the same computation time as with your loop (~20% slower) – Mrml91 Mar 21 '20 at 12:09

1 Answers1

1

grouped.text.agg(lambda z: ''.join(z.values)) proved to be ~10% faster than the loop for me.

Mrml91
  • 456
  • 3
  • 5