I have a pandas dataframe with a text column. Now I want to group this dataframe and concatenate the text columns - here's some code to generate an example dataframe:
import numpy as np
import pandas as pd
import string
import random
def text_generator(size=6, chars=string.ascii_lowercase):
return ''.join(random.choice(chars) for _ in range(size))
items, clusters, texts = [], [], []
for item in range(200):
for cluster in range(1000):
for line in range(random.randint(1, 4)):
items.append(item)
clusters.append(cluster)
texts.append(text_generator())
df = pd.DataFrame({'item_id': items, 'cluster_id': clusters, 'text': texts})
Now I group by the columns 'item_id' and 'cluster_id' and create a new dataframe for the aggregated result:
grouped = df.groupby(('item_id', 'cluster_id'))
df_cluster = pd.DataFrame(grouped.size()).rename(columns={0: 'cluster_size'})
Maybe I'm wrong, but the obvious solution seems to be to aggregate the text like this:
df_cluster['texts'] = grouped.text.agg(lambda x: ' '.join(x))
But this takes about 10s. For a few megabytes of data? Weird. So I tested a standard python solution for this:
text_lookup = {}
for item_id, cluster_id, text in zip(df.item_id.values, df.cluster_id.values, df.text.values):
text_lookup.setdefault((item_id, cluster_id), []).append(text)
item_ids, cluster_ids, all_texts = [], [], []
for (item_id, cluster_id), texts in text_lookup.items():
item_ids.append(item_id)
cluster_ids.append(cluster_id)
all_texts.append(' '.join([t for t in texts if t is not np.nan]))
df_tags = pd.DataFrame({'item_id': item_ids, 'cluster_id': cluster_ids, 'texts': all_texts}).set_index(['item_id', 'cluster_id'])
df_cluster = df_cluster.merge(df_tags, left_index=True, right_index=True)
This should be much slower, because I'm doing all this for loops in python, but it takes only 3s. Probably I'm doing something wrong, but I don't now what :).