I'm trying to combine multiple sets of rows together to remove duplicates in a CSV, using python and pandas. Based on a common value, the 'ID', Where there are duplicate rows the values from another column 'HostAffected' should be combined with a line break. Similar to this post: enter link description here however I need to keep all exisitng values that equate to the same ID. I have already done a similar thing with columns using the code below as an example, however its not quite the same:
df = pd.read_csv("output.csv")
cols = ['Host','Protocol','Port']
newcol = ['/'.join(i) for i in zip(df['Host'],df['Protocol'],df['Port'].map(str))]
df = df.assign(HostAffected=newcol).drop(cols, 1)
I have this code so far:
df.groupby(['Plugin ID','Description])[HostAffected'].apply(list)
Adapted from this thread: enter link description here however this does not work.
A example set of data I would have looks like this:
PluginID Description HostAffected
10395 Windows SMB Shares Enumeration 10.0.0.10/tcp/445
10396 Windows SMB Shares Access 10.0.0.10/tcp/445
10396 Windows SMB Shares Access 192.168.0.12/tcp/445
10398 Windows SMB LsaQueryInformationPolicy 10.0.0.10/tcp/445
10399 SMB Use Domain SID to Enumerate Users 10.0.0.10/tcp/445
10400 Windows SMB Registry Remotely Accessible 10.0.0.10/tcp/445
10736 DCE Services Enumeration 10.0.0.10/tcp/139
10736 DCE Services Enumeration 10.0.0.10/tcp/445
10736 DCE Services Enumeration 192.168.0.12/tcp/445
The values are comma separated however I have used spaces to make it more clear. I want it to look like this, where there is only one unique row for 'Plugin ID' and 'Description' and 'HostAffected' column is combined:
ID Description HostAffected
10395 Windows SMB Shares Enumeration 10.0.0.10/tcp/445
10396 Windows SMB Shares Access 10.0.0.10/tcp/445
192.168.0.12/tcp/445
10398 Windows SMB LsaQueryInformationPolicy 10.0.0.10/tcp/445
10399 SMB Use Domain SID to Enumerate Users 10.0.0.10/tcp/445
10400 Windows SMB Registry Remotely Accessible 10.0.0.10/tcp/445
10736 DCE Services Enumeration 10.0.0.10/tcp/139
10.0.0.10/tcp/445
192.168.0.12/tcp/445
Essentially there might be the same ID and Description for multiple sets of HostsAffected. Any help would be much appreciated, as this is slightly more complex and challenging than combining columns together.