Pandas dataframe join/consolidate into single column as string concatenation, do it cheap

Question

I have a pandas dataframe imported from csv with format as follows:

IP OS port protocol ... 0.0.0.0 Linux 5000 tcp ... 0.0.0.0 Linux 5001 tcp ... 0.0.0.0 Linux 5002 udp ... 0.0.0.1 Windows 4008 tcp ... 0.0.0.1 Windows 4009 udp ... ... ... ... ... ...

I need to consolidate this dataframe in order for IP to be unique, like so:

IP OS port ... 0.0.0.0 Linux 5000(tcp), 5001(tcp), 5002(udp) ... 0.0.0.1 Windows 4008(tcp), 4009(udp) ... ... ... ... ...

In the resulting dataframe, only those three columns matter, I do not care about the rest. Note the string concatenation in port. Also, inplace processing is an option but no requirement as I will further process the data. The original table is quite long though.

Everything I found only merges into several columns, not into a single one. As I want the operation to be cheap and fast (long table) I am wondering if there is an onboard method that I am missing. My ideas form reading the docu, all not quite up to the task as I am not that familiar with pandas:

dataframe.join(): joins into separate columns next to each other, but I need the single one.
dataframe.merge(): does not make the key unique.
Loop over the field and do the string concatenation like if element is in dataframe [...] to kind of copy the ports up to the first occurrence of IP, then do a dataframe.drop_duplicates(['IP'], keep=first, inplace=true) to delete every duplicate of the key except the now consolidated first one. This seems really expensive though.
This answer will not do the job as it joins the strings to fields in a single column, but does not provide an option to concatenate braces into the string or format it in any way. I am explicitly not looking for a field as an output but for a clean string that can be formatted as seen above.

Is there a way to go that I do not know of or a best practice to do this the pythonic way?

Possible duplicate of [Pandas groupby: How to get a union of strings](https://stackoverflow.com/questions/17841149/pandas-groupby-how-to-get-a-union-of-strings) — Georgy, Jun 13 '19 at 08:42
@Georgy unfortunately not. Edited to separate my question from your finding. I am specifically looking for a string solution that I can format, not a field as asked in the suggested duplicate. Totally different approach, would have liked it to be that easy. — harmonica141, Jun 13 '19 at 08:59

jezrael · Accepted Answer · 2019-06-13T10:29:31.193

2

First join columns together with () and then use DataFrame.groupby with join:

df['port'] = df['port'] .astype(str) + '(' + df['protocol'] + ')'

#if possible duplicates
#df = df.drop_duplicates(['P','OS','port'])
df = df.groupby(['IP','OS'])['port'].apply(', '.join).reset_index()
print (df)
        IP       OS                             port
0  0.0.0.0    Linux  5000(tcp), 5001(tcp), 5002(udp)
1  0.0.0.1  Windows             4008(tcp), 4009(udp)

Alternative solution if performance is important:

df = df.groupby(['IP','OS'])['port'].agg(', '.join).reset_index()

edited Jun 13 '19 at 10:29

answered Jun 13 '19 at 07:49

jezrael

822,522
95
1,334
1,252

Solution works but, according to docu, is more expensive than single call to pivot_table(). – harmonica141 Jun 13 '19 at 10:23
1

@harmonica141 - Can you test `df = df.groupby(['P','OS'])['port'].agg(', '.join).reset_index()` ? – jezrael Jun 13 '19 at 10:26
1

Works fine and is even faster, I timed it. This will be the solution, let me edit it in. – harmonica141 Jun 13 '19 at 10:29

score 2 · Answer 2 · edited Jun 13 '19 at 10:23

2

You can use the function pivot_table():

df['port'] = df.port.astype(str) + '(' + df.protocol + ')'
df = df.pivot_table(index=['IP', 'OS'], values ='port', aggfunc=', '.join)

Result:

IP      OS                                  port
0.0.0.0 Linux    5000(tcp), 5001(tcp), 5002(udp)
0.0.0.1 Windows             4008(tcp), 4009(udp)

edited Jun 13 '19 at 10:23

harmonica141

1,389
2
23
27

answered Jun 13 '19 at 08:15

Mykola Zotko

15,583
3
71
73

This is really elegant, but only works half way. The strings are being produced just fine (per line), but the join does not take place in any way. I still end up with separate lines. Checked the docu, can't explain. – harmonica141 Jun 13 '19 at 10:05

Pandas dataframe join/consolidate into single column as string concatenation, do it cheap

2 Answers2