3

I want to convert a number to binary and store in multiple columns in Pandas using Python. Here is an example.

df = pd.DataFrame([['a', 1], ['b', 2], ['c', 0]], columns=["Col_A", "Col_B"])

for i in range(0,len(df)):
    df.loc[i,'Col_C'],df.loc[i,'Col_D'] = list( (bin(df.loc[i,'Col_B']).zfill(2) ) )

I am trying to convert a binary and store it in a multiple columns in dataframe. After converting number to Binary, output has to contains 2 digits. It is working fine.

Question: If my dataset contains thousands of records, I can see performance difference. If I want to improve performance of above code how do we do it? I tried using following single line code, which didn't work for me.

df[['Col_C','Col_D']] = list( (bin(df['Col_B']).zfill(2) ) )
Deva K
  • 87
  • 6

2 Answers2

4

If performance is important, use numpy with this solution:

d = df['Col_B'].values
m = 2
df[['Col_C','Col_D']]  = pd.DataFrame((((d[:,None] & (1 << np.arange(m)))) > 0).astype(int))
print (df)
  Col_A  Col_B  Col_C  Col_D
0     a      1      1      0
1     b      2      0      1
2     c      0      0      0

Performance (about 1000 times faster):

df = pd.DataFrame([['a', 1], ['b', 2], ['c', 0]], columns=["Col_A", "Col_B"])


df = pd.concat([df] * 1000, ignore_index=True)

In [162]: %%timeit
     ...: df[['Col_C','Col_D']] = df['Col_B'].apply(lambda x: pd.Series(list(bin(x)[2:].zfill(2))))
     ...: 
609 ms ± 14.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [163]: %%timeit
     ...: d = df['Col_B'].values
     ...: m = 2
     ...: df[['Col_C','Col_D']]  = pd.DataFrame((((d[:,None] & (1 << np.arange(m)))) > 0).astype(int))
     ...: 
618 µs ± 26.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

apply is the method you are looking for.

df[['Col_C','Col_D']] = df['Col_B'].apply(lambda x: pd.Series(list(bin(x)[2:].zfill(2))))

does the trick.

I benchmarked it on 3000 rows and it is faster than the for cycle method you mention (0.5 seconds vs 3 seconds). But generally the speed won't be much faster since it still needs to apply the function for each row separately.

from time import time
start = time()
for i in range(0,len(df)):
    df.loc[i,'Col_C'],df.loc[i,'Col_D'] = list( (bin(df.loc[i,'Col_B'])[2:].zfill(2) ) )
print(time() - start)
# 3.4339962005615234

start = time()
df[['Col_C','Col_D']] = df['Col_B'].apply(lambda x: pd.Series(list(bin(x)[2:].zfill(2))))
print(time() - start)
# 0.5619983673095703

Note: I am using python 3, so e.g. bin(1) returns '0b1' and thus I use bin(x)[2:] to get rid of the '0b' part.

Matěj Račinský
  • 1,679
  • 1
  • 16
  • 28
  • @jezrael, your solution worked. This is really faster. I have processed 50K records, using your solution it took nearly 13s. Matej solution took only less than 1s. I need to process huge data. I want to go with performance. – Deva K Feb 11 '19 at 01:02