0

I am using a network trace dataset, and have loaded the initial data into a pandas dataframe, which looks like this:

Initial Dataframe

I have created a python dict with common port numbers and applications names like

port_dict = {80: 'http', 20: 'ftp', 21: 'ftp'}

and I want to modify my dataframe by adding additional columns whose names will be the unique values of the ports_dict and if either of sport or dport contains the relevant key, the newly added column should have a value True, False otherwise, like this:

Modified Dataframe

In the above picture, the column https should have True as the sport is 443.

How would I go about accomplishing this?

galath
  • 5,717
  • 10
  • 29
  • 41
maverick93
  • 143
  • 1
  • 4
  • 13
  • Looks like a nice usecase for a multi column apply. Just write a function with your mapping dict. Apply this function to slice of your columns. Finish. http://stackoverflow.com/questions/16353729/pandas-how-to-use-apply-function-to-multiple-columns – PlagTag Jul 27 '15 at 06:28

2 Answers2

2

Try this out. Series.map should be a faster way to look up values from the dictionary. pandas.get_dummies turns a single column of data into columns for each distinct value as 1s / 0s, which I'm converting into a bool, and compare with or (|) to get whether the service was on either port.

service = pd.get_dummies(df['sport'].map(port_dict)).astype(bool) | pd.get_dummies(df['sport'].map(port_dict)).astype(bool)

df[services.columns] = services

In [166]: df.head()
Out[166]: 
   dport  sport    ftp   http
0      1      1  False  False
1     80      2  False  False
2      2     80  False   True
3      3     20   True  False
4      1      1  False  False
chrisb
  • 49,833
  • 8
  • 70
  • 70
1

If I may suggest that you will simply have a service column, then if the sport or dport are in the port_dict keys then the value will be written in the service column:

port_dict = {80: 'http', 20: 'ftp', 21: 'ftp'}

df = pd.DataFrame(data={'sport':[1, 2, 80, 20], 'dport':[1, 80, 2, 3]})

for i in df.index:
    found_service = port_dict.get(df.ix[i, 'sport'], False) or port_dict.get(df.ix[i, 'dport'], False)
    df.at[i, 'service'] = found_service

# a small example dataframe
>>       dport  sport service
      0      1      1    False
      1     80      2    http
      2      2     80    http
      3      3     20     ftp
DeepSpace
  • 78,697
  • 11
  • 109
  • 154
  • Thanks for the suggestion, it works, but because of the looping and since my dataset contains around 47121615 rows, it takes an awful lot of time. Can you suggest me ways to optimise this? – maverick93 Jul 25 '15 at 19:48
  • Try this way instear of the `for` loop. It will create 2 new columns, `sservice` and `dservice`: `df[['dservice', 'sservice']] = df[['dport','sport']].applymap(port_dict.get)` – DeepSpace Jul 25 '15 at 20:32