Python: using groupby and apply to rebulid dataframe

Question

I want to rebulid my dataframe from df1 to df2:

df1 like this:

id	counts	days
1	2	4
1	3	4
1	4	4
2	56	8
2	37	9
2	10	7
2	10	4

df2 like this:

id	countsList	daysList
1	'2,3,4'	'4,4,4'
2	'56,37,10,10'	'8,9,7,4'

where countsList and daysList in df2 is a str.

I have about 1 million lines of df1, it would be very slow if I using for iter.

So I want to using groupby and apply to achieve it. Do you have any solution or efficient way to cover it.

My computer info:

CPU: Xeon 6226R 2.9Ghz 32core
RAM: 16G
python:3.9.7

Does this answer your question? [How to group dataframe rows into list in pandas groupby](https://stackoverflow.com/questions/22219004/how-to-group-dataframe-rows-into-list-in-pandas-groupby) — G. Anderson, Jan 06 '22 at 16:34
This is still going to be very slow, because these string aggregations aren't particularly performant. So opposed to something like calculating a group mean, the runtime for ','.joining strings is going to scale poorly with the number of groups. You'll be using some slow python loop over the groups even if it's disguised by a `.groupby.agg` — ALollz, Jan 06 '22 at 17:07

Daniel Wlazło · Answer 1 · 2022-01-07T00:50:00.170

You might use agg (and then rename columns)

np.random.seed(123)
n = 1_000_000
df = pd.DataFrame({
    "id":  np.random.randint(100_000, size = n),
    "counts": np.random.randint(10, size = n),
    "days": np.random.randint(10, size = n)
})

df2 = df.groupby('id').agg(lambda x: ','.join(map(str, x)))\
         .add_suffix('List').reset_index()

#   id      countsList      daysList
#0  15725   7,5,6,3,7,0     7,9,5,8,0,1
#1  28030   7,6,5,1,9,6,5   5,0,8,4,8,6,0

It isn't "that" slow - %%timeit for 1 milion rows and 100k groups:

639 ms ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

EDIT: Solution proposed here: How to group dataframe rows into list in pandas groupby is a bit faster:

id, counts, days = df.values[df.values[:, 0].argsort()].T
u_ids, index = np.unique(id, True)
counts = np.split(counts, index[1:])
days = np.split(days, index[1:])
df2 = pd.DataFrame({'id':u_ids, 'counts':counts, 'days':days})

but not mega faster:

313 ms ± 6.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Python: using groupby and apply to rebulid dataframe

1 Answers1