I have a pandas dataframe imported from csv with format as follows:
IP OS port protocol ...
0.0.0.0 Linux 5000 tcp ...
0.0.0.0 Linux 5001 tcp ...
0.0.0.0 Linux 5002 udp ...
0.0.0.1 Windows 4008 tcp ...
0.0.0.1 Windows 4009 udp ...
... ... ... ... ...
I need to consolidate this dataframe in order for IP
to be unique, like so:
IP OS port ...
0.0.0.0 Linux 5000(tcp), 5001(tcp), 5002(udp) ...
0.0.0.1 Windows 4008(tcp), 4009(udp) ...
... ... ... ...
In the resulting dataframe, only those three columns matter, I do not care about the rest. Note the string concatenation in port
. Also, inplace processing is an option but no requirement as I will further process the data. The original table is quite long though.
Everything I found only merges into several columns, not into a single one. As I want the operation to be cheap and fast (long table) I am wondering if there is an onboard method that I am missing. My ideas form reading the docu, all not quite up to the task as I am not that familiar with pandas:
dataframe.join()
: joins into separate columns next to each other, but I need the single one.dataframe.merge()
: does not make the key unique.- Loop over the field and do the string concatenation like
if element is in dataframe [...]
to kind of copy the ports up to the first occurrence ofIP
, then do adataframe.drop_duplicates(['IP'], keep=first, inplace=true)
to delete every duplicate of the key except the now consolidated first one. This seems really expensive though. - This answer will not do the job as it joins the strings to fields in a single column, but does not provide an option to concatenate braces into the string or format it in any way. I am explicitly not looking for a field as an output but for a clean string that can be formatted as seen above.
Is there a way to go that I do not know of or a best practice to do this the pythonic way?