2

I have a pandas dataframe imported from csv with format as follows:

IP OS port protocol ... 0.0.0.0 Linux 5000 tcp ... 0.0.0.0 Linux 5001 tcp ... 0.0.0.0 Linux 5002 udp ... 0.0.0.1 Windows 4008 tcp ... 0.0.0.1 Windows 4009 udp ... ... ... ... ... ...

I need to consolidate this dataframe in order for IP to be unique, like so:

IP OS port ... 0.0.0.0 Linux 5000(tcp), 5001(tcp), 5002(udp) ... 0.0.0.1 Windows 4008(tcp), 4009(udp) ... ... ... ... ...

In the resulting dataframe, only those three columns matter, I do not care about the rest. Note the string concatenation in port. Also, inplace processing is an option but no requirement as I will further process the data. The original table is quite long though.

Everything I found only merges into several columns, not into a single one. As I want the operation to be cheap and fast (long table) I am wondering if there is an onboard method that I am missing. My ideas form reading the docu, all not quite up to the task as I am not that familiar with pandas:

  • dataframe.join(): joins into separate columns next to each other, but I need the single one.
  • dataframe.merge(): does not make the key unique.
  • Loop over the field and do the string concatenation like if element is in dataframe [...] to kind of copy the ports up to the first occurrence of IP, then do a dataframe.drop_duplicates(['IP'], keep=first, inplace=true) to delete every duplicate of the key except the now consolidated first one. This seems really expensive though.
  • This answer will not do the job as it joins the strings to fields in a single column, but does not provide an option to concatenate braces into the string or format it in any way. I am explicitly not looking for a field as an output but for a clean string that can be formatted as seen above.

Is there a way to go that I do not know of or a best practice to do this the pythonic way?

harmonica141
  • 1,389
  • 2
  • 23
  • 27
  • Possible duplicate of [Pandas groupby: How to get a union of strings](https://stackoverflow.com/questions/17841149/pandas-groupby-how-to-get-a-union-of-strings) – Georgy Jun 13 '19 at 08:42
  • @Georgy unfortunately not. Edited to separate my question from your finding. I am specifically looking for a string solution that I can format, not a field as asked in the suggested duplicate. Totally different approach, would have liked it to be that easy. – harmonica141 Jun 13 '19 at 08:59

2 Answers2

2

First join columns together with () and then use DataFrame.groupby with join:

df['port'] = df['port'] .astype(str) + '(' + df['protocol'] + ')'

#if possible duplicates
#df = df.drop_duplicates(['P','OS','port'])
df = df.groupby(['IP','OS'])['port'].apply(', '.join).reset_index()
print (df)
        IP       OS                             port
0  0.0.0.0    Linux  5000(tcp), 5001(tcp), 5002(udp)
1  0.0.0.1  Windows             4008(tcp), 4009(udp)

Alternative solution if performance is important:

df = df.groupby(['IP','OS'])['port'].agg(', '.join).reset_index()
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

You can use the function pivot_table():

df['port'] = df.port.astype(str) + '(' + df.protocol + ')'
df = df.pivot_table(index=['IP', 'OS'], values ='port', aggfunc=', '.join)

Result:

IP      OS                                  port
0.0.0.0 Linux    5000(tcp), 5001(tcp), 5002(udp)
0.0.0.1 Windows             4008(tcp), 4009(udp)
harmonica141
  • 1,389
  • 2
  • 23
  • 27
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
  • This is really elegant, but only works half way. The strings are being produced just fine (per line), but the join does not take place in any way. I still end up with separate lines. Checked the docu, can't explain. – harmonica141 Jun 13 '19 at 10:05