1

Hi I'm trying to combine several existing columns into 1 new column then delete the three original ones in a CSV file. I have been trying to do this with pandas however not having much luck. I'm fairly new to python.

My code first combines several CSV files in the same directory an then attempts to manipulates the columns. The first combine works and I get an output.csv with the combined data, however the combine of columns does not.

import glob
import pandas as pd

interesting_files = glob.glob("*.csv")

header_saved = False
with open('output.csv','wb') as fout:
    for filename in interesting_files:
        with open(filename) as fin:
            header = next(fin)
            if not header_saved:
                fout.write(header)
                header_saved = True
            for line in fin:
                fout.write(line)

df = pd.read_csv("output.csv")
df['HostAffected']=df['Host'] + "/" + df['Protocol'] + "/" + df['Port']
df.to_csv("newoutput.csv")

Effectively turning this:

Host,Protocol,Port
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,49707
10.0.0.10,tcp,49672
10.0.0.10,tcp,49670

into something like this:

HostsAffected
10.0.0.10/tcp/445
10.0.0.10/tcp/445
10.0.0.10/tcp/445
10.0.0.10/tcp/445
10.0.0.10/tcp/445
10.0.0.10/tcp/445
10.0.0.11/tcp/445
10.0.0.11/tcp/49707
10.0.0.11/tcp/49672
10.0.0.11/tcp/49670
10.0.0.11/tcp/49668
10.0.0.11/tcp/49667

There are other columns in the csv however.

I'm not a coder, I'm just trying to solve a problem, any help much appreciated.

jpp
  • 159,742
  • 34
  • 281
  • 339
Abob
  • 99
  • 10
  • 3
    Sorry but what is the problem? – Anton vBR May 13 '18 at 13:50
  • 1
    Possible duplicate of [Combine two columns of text in dataframe in pandas/python](https://stackoverflow.com/questions/19377969/combine-two-columns-of-text-in-dataframe-in-pandas-python) – wwii May 13 '18 at 14:09

3 Answers3

2

The way I see it we have three alternatives:

%timeit df['Host'] + "/" + df['Protocol'] + "/" + df['Port'].map(str)
%timeit ['/'.join(i) for i in zip(df['Host'],df['Protocol'],df['Port'].map(str))]
%timeit ['/'.join(i) for i in df[['Host','Protocol','Port']].astype(str).values]

Timings:

10 loops, best of 3: 39.7 ms per loop  
10 loops, best of 3: 35.9 ms per loop  
10 loops, best of 3: 162 ms per loop

However slowest I think this would be your most readable approach:

import pandas as pd

data = '''\
ID,Host,Protocol,Port
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,49707
1,10.0.0.10,tcp,49672
1,10.0.0.10,tcp,49670'''

df = pd.read_csv(pd.compat.StringIO(data)) # Recreates a sample dataframe

cols = ['Host','Protocol','Port']
newcol = ['/'.join(i) for i in df[cols].astype(str).values]
df = df.assign(HostAffected=newcol).drop(cols, 1)
print(df)

Returns:

   ID         HostAffected
0   1    10.0.0.10/tcp/445
1   1    10.0.0.10/tcp/445
2   1    10.0.0.10/tcp/445
3   1    10.0.0.10/tcp/445
4   1    10.0.0.10/tcp/445
5   1    10.0.0.10/tcp/445
6   1    10.0.0.10/tcp/445
7   1  10.0.0.10/tcp/49707
8   1  10.0.0.10/tcp/49672
9   1  10.0.0.10/tcp/49670
Anton vBR
  • 18,287
  • 5
  • 40
  • 46
  • 1
    @jpp Ok. Now it is confirmed. `zip()` should be the fastest solution. – Anton vBR May 13 '18 at 14:28
  • Yep, I agree. Thanks for updating. Still closer, but now we have good benchmarks. – jpp May 13 '18 at 14:29
  • I have tried this method also and works great. thanks. – Abob May 13 '18 at 15:18
  • @Anton vBR if one of these fields is empty the script borks - is there a way around this? for example if tcp is missing for any given line it borks where as it should just skip and combine the ip and port. – Abob May 13 '18 at 16:58
  • @Abob empty as in `''`? – Anton vBR May 13 '18 at 20:09
  • @Abob What version of the solutions are u using? You could try: `newcol = ['/'.join(i).replace('//','/') for i in df[cols].fillna('').astype(str).values]` for instance. – Anton vBR May 13 '18 at 20:10
  • @Anton vBR if TCP was empty or null, ie not present, then the end value should be similar to 10.0.0.10/445 I'm currently using newcol = ['/'.join(i) for i in zip(df['Host'],df['Protocol'],df['Port'].map(str))] however all values need to be present for it to work. – Abob May 13 '18 at 21:13
  • @Abob Maybe change the join: `'/'.join([y for y in i if y])`. You just need to clean the data. – Anton vBR May 13 '18 at 21:14
1

There are couple of ways you can do this: either use vectorised functions to combine series, or use a lambda function with pd.Series.apply.

Vectorised solution

Don't forget to cast non-numeric types as str.

df['HostAffected'] = df['Host'] + '/' + df['Protocol'] + '/' + df['Port'].map(str)

Performance note: Converting a series of ints to strings - Why is apply much faster than astype?

Apply lambda function

df['HostsAffected'] = df.apply(lambda x: '/'.join(list(map(str, x))), axis=1)

With both solutions, you can simply filter by this column to remove all others:

df = df[['HostsAffected']]

Complete example

from io import StringIO
import pandas as pd

mystr = StringIO("""Host,Protocol,Port
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,49707
10.0.0.10,tcp,49672
10.0.0.10,tcp,49670""")

# replace mystr with 'file.csv'
df = pd.read_csv(mystr)

# combine columns
df['HostsAffected'] = df['Host'] + '/' + df['Protocol'] + '/' + df['Port'].map(str)

# include only new columns
df = df[['HostsAffected']]

Result:

print(df)

         HostsAffected
0    10.0.0.10/tcp/445
1    10.0.0.10/tcp/445
2    10.0.0.10/tcp/445
3    10.0.0.10/tcp/445
4    10.0.0.10/tcp/445
5    10.0.0.10/tcp/445
6    10.0.0.10/tcp/445
7  10.0.0.10/tcp/49707
8  10.0.0.10/tcp/49672
9  10.0.0.10/tcp/49670
jpp
  • 159,742
  • 34
  • 281
  • 339
  • 1
    Thanks for your help on this. Adding the .map(str) to the end of the combine columns fixed the issue! – Abob May 13 '18 at 14:19
0

This is how you can do it:

    dt = """Host,Protocol,Port
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,49707
10.0.0.10,tcp,49672
10.0.0.10,tcp,49670"""

tdf = pd.read_csv(pd.compat.StringIO(dt))
tdf['HostsAffected'] = tdf.apply(lambda x: '{}/{}/{}'.format(x['Host'] , x['Protocol'] , x['Port']), axis=1)
tdf = tdf[['HostsAffected']]
tdf.to_csv(<path-to-save-csv-file>)

This will be the output:

    HostsAffected
0   10.0.0.10/tcp/445
1   10.0.0.10/tcp/445
2   10.0.0.10/tcp/445
3   10.0.0.10/tcp/445
4   10.0.0.10/tcp/445
5   10.0.0.10/tcp/445
6   10.0.0.10/tcp/445
7   10.0.0.10/tcp/49707
8   10.0.0.10/tcp/49672
9   10.0.0.10/tcp/49670

If you are reading the CSV from the file, edit the read_csv line as follows:

tdf = pd.read_csv(<path-to-the-file>)
Moh
  • 1,887
  • 3
  • 18
  • 29