Create Boolean Columns in Pandas Dataframe using Dictionary

Question

I am using a network trace dataset, and have loaded the initial data into a pandas dataframe, which looks like this:

I have created a python dict with common port numbers and applications names like

port_dict = {80: 'http', 20: 'ftp', 21: 'ftp'}

and I want to modify my dataframe by adding additional columns whose names will be the unique values of the ports_dict and if either of sport or dport contains the relevant key, the newly added column should have a value True, False otherwise, like this:

In the above picture, the column https should have True as the sport is 443.

How would I go about accomplishing this?

Looks like a nice usecase for a multi column apply. Just write a function with your mapping dict. Apply this function to slice of your columns. Finish. http://stackoverflow.com/questions/16353729/pandas-how-to-use-apply-function-to-multiple-columns — PlagTag, Jul 27 '15 at 06:28

score 2 · Accepted Answer · answered Jul 26 '15 at 11:57

Try this out. Series.map should be a faster way to look up values from the dictionary. pandas.get_dummies turns a single column of data into columns for each distinct value as 1s / 0s, which I'm converting into a bool, and compare with or (|) to get whether the service was on either port.

service = pd.get_dummies(df['sport'].map(port_dict)).astype(bool) | pd.get_dummies(df['sport'].map(port_dict)).astype(bool)

df[services.columns] = services

In [166]: df.head()
Out[166]: 
   dport  sport    ftp   http
0      1      1  False  False
1     80      2  False  False
2      2     80  False   True
3      3     20   True  False
4      1      1  False  False

So this way can I deal with the nominal values of the port numbers too?? — maverick93, Jul 28 '15 at 16:59

score 1 · Answer 2 · answered Jul 25 '15 at 19:12

1

If I may suggest that you will simply have a service column, then if the sport or dport are in the port_dict keys then the value will be written in the service column:

port_dict = {80: 'http', 20: 'ftp', 21: 'ftp'}

df = pd.DataFrame(data={'sport':[1, 2, 80, 20], 'dport':[1, 80, 2, 3]})

for i in df.index:
    found_service = port_dict.get(df.ix[i, 'sport'], False) or port_dict.get(df.ix[i, 'dport'], False)
    df.at[i, 'service'] = found_service

# a small example dataframe
>>       dport  sport service
      0      1      1    False
      1     80      2    http
      2      2     80    http
      3      3     20     ftp

answered Jul 25 '15 at 19:12

DeepSpace

78,697
11
109
154

Thanks for the suggestion, it works, but because of the looping and since my dataset contains around 47121615 rows, it takes an awful lot of time. Can you suggest me ways to optimise this? – maverick93 Jul 25 '15 at 19:48
Try this way instear of the `for` loop. It will create 2 new columns, `sservice` and `dservice`: `df[['dservice', 'sservice']] = df[['dport','sport']].applymap(port_dict.get)` – DeepSpace Jul 25 '15 at 20:32

Create Boolean Columns in Pandas Dataframe using Dictionary

2 Answers2