How to create a dummy variable in Pandas Dataframe if a column matches certain values?

Question

I have a Pandas Dataframe with a column (ip) with certain values and another Pandas Series not in this DataFrame with a collection of these values. I want to create a column in the DataFrame that is 1 if a given line has its ipin my Pandas Series (black_ip).

import pandas as pd

dict = {'ip': {0: 103022, 1: 114221, 2: 47902, 3: 23550, 4: 84644}, 'os': {0: 23, 1: 19, 2: 17, 3: 13, 4: 19}}

df = pd.DataFrame(dict)

df
     ip  os
0  103022  23
1  114221  19
2   47902  17
3   23550  13
4   84644  19

blacklist = pd.Series([103022, 23550])

blacklist

0    103022
1     23550

My question is: how can I create a new column in df such that it shows 1 when the given ip in the blacklist and zero otherwise?

Sorry if this too dumb, I'm still new to programming. Thanks a lot in advance!

What is obviously length of `blacklist` values? – jezrael Apr 15 '18 at 17:00 — jezrael, Apr 15 '18 at 17:00
It's MUCH smaller than the DataFrame. – Raul Guarini Riva Apr 15 '18 at 18:54 — Raul Guarini Riva, Apr 15 '18 at 18:54

jezrael · Accepted Answer · 2018-04-15T17:06:43.560

Use isin with astype:

df['new'] = df['ip'].isin(blacklist).astype(np.int8)

Also is possible convert column to categoricals:

df['new'] = pd.Categorical(df['ip'].isin(blacklist).astype(np.int8))

print (df)
       ip  os  new
0  103022  23    1
1  114221  19    0
2   47902  17    0
3   23550  13    1
4   84644  19    0

For interesting in large DataFrame converting to Categorical not save memory:

df = pd.concat([df] * 10000, ignore_index=True)

df['new1'] = pd.Categorical(df['ip'].isin(blacklist).astype(np.int8))
df['new2'] = df['ip'].isin(blacklist).astype(np.int8)
df['new3'] = df['ip'].isin(blacklist)
print (df.memory_usage())
Index        80
ip       400000
os       400000
new1      50096
new2      50000
new3      50000
dtype: int64

Timings:

np.random.seed(4545)

N = 10000
df = pd.DataFrame(np.random.randint(1000,size=N), columns=['ip'])
print (len(df))
10000

blacklist = pd.Series(np.random.randint(500,size=int(N/100)))
print (len(blacklist))
100

In [320]: %timeit df['ip'].isin(blacklist).astype(np.int8)
465 µs ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [321]: %timeit pd.Categorical(df['ip'].isin(blacklist).astype(np.int8))
915 µs ± 49.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [322]: %timeit pd.Categorical(df['ip'], categories = blacklist.unique()).notnull().astype(int)
1.59 ms ± 20.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [323]: %timeit df['new_column'] = [1 if x in blacklist.values else 0 for x in df.ip]
81.8 ms ± 2.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Thanks for the answer! Is there any diference if I use ```uint8``` instead of int? — Raul Guarini Riva, Apr 15 '18 at 15:57
@RaulGuarini - It is good solution if need save memory and large `DataFrame` :) — jezrael, Apr 15 '18 at 15:58
Yeah, this is really the case. The real dataframe has 200+ MM lines and a couple columns... only categorical data. Doing Machine Learning with it is a real mess in terms of memory consumption! — Raul Guarini Riva, Apr 15 '18 at 16:00

sacuL · Answer 2 · 2018-04-15T17:29:57.303

0

Slow, but simple and readable method:

Another way to do this would be to use create your new column using a list comprehension, set to assign a 1 if your ip value is in blacklist and a 0 otherwise:

df['new_column'] = [1 if x in blacklist.values else 0 for x in df.ip]

>>> df
       ip  os  new_column
0  103022  23           1
1  114221  19           0
2   47902  17           0
3   23550  13           1
4   84644  19           0

EDIT: Faster method building on Categorical: If you want to maximize speed, the following would be quite fast, though not quite as fast as the .isin non-categorical method. It builds on the use of pd.Categorical as suggested by @jezrael, but leveraging it's capacity for assigning categories:

df['new_column'] = pd.Categorical(df['ip'], 
          categories = blacklist.unique()).notnull().astype(int)

Timings:

import numpy as np
import pandas as pd
np.random.seed(4545)
N = 10000
df = pd.DataFrame(np.random.randint(1000,size=N), columns=['ip'])
blacklist = pd.Series(np.random.randint(500,size=int(N/100)))

%timeit df['ip'].isin(blacklist).astype(np.int8)
# 453 µs ± 8.81 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit pd.Categorical(df['ip'].isin(blacklist).astype(np.int8))
# 892 µs ± 17.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit pd.Categorical(df['ip'], categories = \
              blacklist.unique()).notnull().astype(int)
# 565 µs ± 32.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

edited Apr 15 '18 at 17:29

answered Apr 15 '18 at 16:27

sacuL

49,704
8
81
106

How do you test it? It seems in small data or reapeting. And better here is not convert to categoricals, so what return `%timeit (df['ip'].isin(blacklist).astype(np.int8)` ? – jezrael Apr 15 '18 at 17:05
1

Please check my timings, I was surprised with yours, so try create with large data. – jezrael Apr 15 '18 at 17:08
Like I said in my answer, your `isin` method is the fastest :). I re-ran my timings on your larger datasets, I'll update my timings, one sec – sacuL Apr 15 '18 at 17:14
on my machine, I still get a performance boost when I assign the categories when converting to `Categorical`, but it seems like we get different results with respects to that... – sacuL Apr 15 '18 at 17:22
Ya, I am surprised too. I use pandas 0.22.0, python 3.5 under windows. And you? – jezrael Apr 15 '18 at 17:22
so it is difference, hmmm. – jezrael Apr 15 '18 at 17:25
Thanks a lot for showing another route! Actually, I was using something very similar to your first solution @sacul. But it was taking too long on my machine. Thanks once more! – Raul Guarini Riva Apr 15 '18 at 18:58

How to create a dummy variable in Pandas Dataframe if a column matches certain values?

2 Answers2