Combine rows to remove duplicates in CSV Python and Pandas

Question

I'm trying to combine multiple sets of rows together to remove duplicates in a CSV, using python and pandas. Based on a common value, the 'ID', Where there are duplicate rows the values from another column 'HostAffected' should be combined with a line break. Similar to this post: enter link description here however I need to keep all exisitng values that equate to the same ID. I have already done a similar thing with columns using the code below as an example, however its not quite the same:

df = pd.read_csv("output.csv")

cols = ['Host','Protocol','Port']
newcol = ['/'.join(i) for i in zip(df['Host'],df['Protocol'],df['Port'].map(str))]
df = df.assign(HostAffected=newcol).drop(cols, 1)

I have this code so far:

df.groupby(['Plugin ID','Description])[HostAffected'].apply(list)

Adapted from this thread: enter link description here however this does not work.

A example set of data I would have looks like this:

PluginID    Description HostAffected
10395   Windows SMB Shares Enumeration  10.0.0.10/tcp/445
10396   Windows SMB Shares Access   10.0.0.10/tcp/445
10396   Windows SMB Shares Access   192.168.0.12/tcp/445
10398   Windows SMB LsaQueryInformationPolicy   10.0.0.10/tcp/445
10399   SMB Use Domain SID to Enumerate Users   10.0.0.10/tcp/445
10400   Windows SMB Registry Remotely Accessible    10.0.0.10/tcp/445
10736   DCE Services Enumeration    10.0.0.10/tcp/139
10736   DCE Services Enumeration    10.0.0.10/tcp/445
10736   DCE Services Enumeration    192.168.0.12/tcp/445

The values are comma separated however I have used spaces to make it more clear. I want it to look like this, where there is only one unique row for 'Plugin ID' and 'Description' and 'HostAffected' column is combined:

ID  Description HostAffected
10395   Windows SMB Shares Enumeration  10.0.0.10/tcp/445
10396   Windows SMB Shares Access   10.0.0.10/tcp/445
192.168.0.12/tcp/445
10398   Windows SMB LsaQueryInformationPolicy   10.0.0.10/tcp/445
10399   SMB Use Domain SID to Enumerate Users   10.0.0.10/tcp/445
10400   Windows SMB Registry Remotely Accessible    10.0.0.10/tcp/445
10736   DCE Services Enumeration    10.0.0.10/tcp/139
10.0.0.10/tcp/445
192.168.0.12/tcp/445

Essentially there might be the same ID and Description for multiple sets of HostsAffected. Any help would be much appreciated, as this is slightly more complex and challenging than combining columns together.

Do you need `df.groupby(['ID','Description'],as_index=False)['HostAffected'].apply(', '.join)` ? — jezrael, May 14 '18 at 05:45
Or `df.groupby(['ID','Description'],as_index=False)['HostAffected'].apply(list)` ? — jezrael, May 14 '18 at 05:46
Hi I have tried both of these niether work, I don't get an error, it just doesn't combine. I did see the answer from yourself @jezrael 20 in a this thread which I thought would work https://stackoverflow.com/questions/49625148/pandas-group-by-on-groupby-to-list-of-lists however does not. — Abob, May 14 '18 at 06:13
@jezrael the output does not combine the HostAffected column per Plugin_ID and Description. In fact it doesn't affect the output in any way, if i comment the line out I get the same output as when not commented out. — Abob, May 14 '18 at 06:21
I check it and it should working nice, there is no traling whitespaces in column `Description` ? Then `df['Description'] - df['Description'].str.strip()` should help. — jezrael, May 14 '18 at 06:28
This now partial works, however I loose Plugin ID and Description from the CSV — Abob, May 14 '18 at 06:36
What about `f.groupby(['ID','Description'])['HostAffected'].apply(list).reset_index()` ? — jezrael, May 14 '18 at 06:37
Yes that works - is there any way to list the combined values via a line break rather than comma? — Abob, May 14 '18 at 06:41
Theoretiically `df.groupby(['ID','Description'],as_index=False)['HostAffected'].apply('\n'.join)` but not sure if not escaped in pandas. — jezrael, May 14 '18 at 06:43
@jezrael this works - 'df.groupby(['Plugin ID','Issue'],as_index=False)['HostAffected'].apply('\n'.join).reset_index()' thanks for your help! — Abob, May 14 '18 at 06:47

jezrael · Accepted Answer · 2018-05-14T06:54:52.363

0

After comments we get it strip if traling wthitespaces with apply and join by line break:

df['Description'] = df['Description'].str.strip()

(df.groupby(['Plugin ID','Issue'])['HostAffected']
   .apply('\n'.join)
   .reset_index())

edited May 14 '18 at 06:54

answered May 14 '18 at 06:49

jezrael

822,522
95
1,334
1,252

Combine rows to remove duplicates in CSV Python and Pandas

1 Answers1