0

I'm trying to combine multiple sets of rows together to remove duplicates in a CSV, using python and pandas. Based on a common value, the 'ID', Where there are duplicate rows the values from another column 'HostAffected' should be combined with a line break. Similar to this post: enter link description here however I need to keep all exisitng values that equate to the same ID. I have already done a similar thing with columns using the code below as an example, however its not quite the same:

df = pd.read_csv("output.csv")

cols = ['Host','Protocol','Port']
newcol = ['/'.join(i) for i in zip(df['Host'],df['Protocol'],df['Port'].map(str))]
df = df.assign(HostAffected=newcol).drop(cols, 1)

I have this code so far:

df.groupby(['Plugin ID','Description])[HostAffected'].apply(list)

Adapted from this thread: enter link description here however this does not work.

A example set of data I would have looks like this:

PluginID    Description HostAffected
10395   Windows SMB Shares Enumeration  10.0.0.10/tcp/445
10396   Windows SMB Shares Access   10.0.0.10/tcp/445
10396   Windows SMB Shares Access   192.168.0.12/tcp/445
10398   Windows SMB LsaQueryInformationPolicy   10.0.0.10/tcp/445
10399   SMB Use Domain SID to Enumerate Users   10.0.0.10/tcp/445
10400   Windows SMB Registry Remotely Accessible    10.0.0.10/tcp/445
10736   DCE Services Enumeration    10.0.0.10/tcp/139
10736   DCE Services Enumeration    10.0.0.10/tcp/445
10736   DCE Services Enumeration    192.168.0.12/tcp/445

The values are comma separated however I have used spaces to make it more clear. I want it to look like this, where there is only one unique row for 'Plugin ID' and 'Description' and 'HostAffected' column is combined:

ID  Description HostAffected
10395   Windows SMB Shares Enumeration  10.0.0.10/tcp/445
10396   Windows SMB Shares Access   10.0.0.10/tcp/445
192.168.0.12/tcp/445
10398   Windows SMB LsaQueryInformationPolicy   10.0.0.10/tcp/445
10399   SMB Use Domain SID to Enumerate Users   10.0.0.10/tcp/445
10400   Windows SMB Registry Remotely Accessible    10.0.0.10/tcp/445
10736   DCE Services Enumeration    10.0.0.10/tcp/139
10.0.0.10/tcp/445
192.168.0.12/tcp/445

Essentially there might be the same ID and Description for multiple sets of HostsAffected. Any help would be much appreciated, as this is slightly more complex and challenging than combining columns together.

Abob
  • 99
  • 10
  • Do you need `df.groupby(['ID','Description'],as_index=False)['HostAffected'].apply(', '.join)` ? – jezrael May 14 '18 at 05:45
  • Or `df.groupby(['ID','Description'],as_index=False)['HostAffected'].apply(list)` ? – jezrael May 14 '18 at 05:46
  • Hi I have tried both of these niether work, I don't get an error, it just doesn't combine. I did see the answer from yourself @jezrael 20 in a this thread which I thought would work https://stackoverflow.com/questions/49625148/pandas-group-by-on-groupby-to-list-of-lists however does not. – Abob May 14 '18 at 06:13
  • Can you explain more why it does not work? – jezrael May 14 '18 at 06:14
  • @jezrael the output does not combine the HostAffected column per Plugin_ID and Description. In fact it doesn't affect the output in any way, if i comment the line out I get the same output as when not commented out. – Abob May 14 '18 at 06:21
  • I check it and it should working nice, there is no traling whitespaces in column `Description` ? Then `df['Description'] - df['Description'].str.strip()` should help. – jezrael May 14 '18 at 06:28
  • This now partial works, however I loose Plugin ID and Description from the CSV – Abob May 14 '18 at 06:36
  • What about `f.groupby(['ID','Description'])['HostAffected'].apply(list).reset_index()` ? – jezrael May 14 '18 at 06:37
  • Yes that works - is there any way to list the combined values via a line break rather than comma? – Abob May 14 '18 at 06:41
  • Theoretiically `df.groupby(['ID','Description'],as_index=False)['HostAffected'].apply('\n'.join)` but not sure if not escaped in pandas. – jezrael May 14 '18 at 06:43
  • @jezrael this works - 'df.groupby(['Plugin ID','Issue'],as_index=False)['HostAffected'].apply('\n'.join).reset_index()' thanks for your help! – Abob May 14 '18 at 06:47
  • I think `,as_index=False` should be removed. – jezrael May 14 '18 at 06:49

1 Answers1

0

After comments we get it strip if traling wthitespaces with apply and join by line break:

df['Description'] = df['Description'].str.strip()

(df.groupby(['Plugin ID','Issue'])['HostAffected']
   .apply('\n'.join)
   .reset_index())
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252