-1

I have dataframe eg. like below

Event['EVENT_ID'] = [ 4162, 4161, 4160, 4159,4158, 4157, 4156, 4155, 4154]

need to convert each row word to binary.

Event['b']=bin(Event['EVENT_ID']) doesn't work 
TypeError: cannot convert the series to <class 'int'>

expected new column with binary, remove 0b and split the column to 16 separate column

bin(4162) = '0b1000001000010'
Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52
Pureti
  • 5
  • 5

3 Answers3

1

Edit

Although my idea is there, this implementation is in fact much slower than all the other answers. Please see @ZaeroDivide's answer and comments below.

Original Answer

I don't think using the bin function and working with the str type is particularly efficient. Please consider using bitmasks.

for i in range(16):
    df[f"bit{i}"] = df["EVENT_ID"].apply(lambda x: x & 1 << i).astype(bool).astype(int)

Testing with your data, I have the following results

   EVENT_ID              B  bit0  bit1  bit2  bit3  bit4  bit5  bit6  bit7  \
0      4162  1000001000010     0     1     0     0     0     0     1     0   
1      4161  1000001000001     1     0     0     0     0     0     1     0   
2      4160  1000001000000     0     0     0     0     0     0     1     0   
3      4159  1000000111111     1     1     1     1     1     1     0     0   
4      4158  1000000111110     0     1     1     1     1     1     0     0   
5      4157  1000000111101     1     0     1     1     1     1     0     0   
6      4156  1000000111100     0     0     1     1     1     1     0     0   
7      4155  1000000111011     1     1     0     1     1     1     0     0   
8      4154  1000000111010     0     1     0     1     1     1     0     0   

   bit8  bit9  bit10  bit11  bit12  bit13  bit14  bit15  
0     0     0      0      0      1      0      0      0  
1     0     0      0      0      1      0      0      0  
2     0     0      0      0      1      0      0      0  
3     0     0      0      0      1      0      0      0  
4     0     0      0      0      1      0      0      0  
5     0     0      0      0      1      0      0      0  
6     0     0      0      0      1      0      0      0  
7     0     0      0      0      1      0      0      0  
8     0     0      0      0      1      0      0      0 
Edward Ji
  • 745
  • 8
  • 19
  • 1
    **this implementation is one of the most inefficient**: There are two nested loops, the for and apply/lambda. It is 100x slower than numpy and 3x slower than other pandas answers – Zaero Divide May 21 '22 at 23:18
  • My idea was to use a bitwise operation instead of string manipulation, but I am unable to locate the right function. I upvoted your answer. – Edward Ji May 22 '22 at 02:31
  • 1
    Usually "low level" operations are available in numpy. Also, operating in the underlaying numpy array is _tipically_ faster. The problem of your answer is that `for`+ `apply lambda` is tipically the slowest implementation, even using "efficient" bitwise operations. See that other pandas answers using strings are actually faster. Strictly speaking however, you are solving the problem :) – Zaero Divide May 22 '22 at 08:11
1

You can also work with numpy, much faster than pandas.

Edit: faster numpy using view Couple of tricks here:

  • Work only with the column of interest
  • Convert the underlaying array to uint16, to ensure compatibility with any integer input
  • Swapbytes to have a proper H,L order (at least on my architecture)
  • Split H,L without actually moving any data with view
  • Run unpackbits and reshape accordingly

My machine requires a byteswap to have the bytes of the uint16 in the proper place. Note that this aproach requires to have the data as int16/uint16, while the other one would work for int64 as well.

import pandas as pd
import numpy as np

df = pd.DataFrame({'EVENT_ID': [ 4162, 4161, 4160, 4159,4158, 4157, 4156, 4155, 4154]}, dtype='uint16')

zz=np.unpackbits(df.EVENT_ID.values.astype('uint16').byteswap().view('uint8')).reshape(-1,16)
df3 = pd.concat([df,pd.DataFrame(zz)],axis=1)

print(f"{df3 =}")

df3 =   EVENT_ID  0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15
0      4162  0  0  0  1  0  0  0  0  0  1   0   0   0   0   1   0
1      4161  0  0  0  1  0  0  0  0  0  1   0   0   0   0   0   1
2      4160  0  0  0  1  0  0  0  0  0  1   0   0   0   0   0   0
3      4159  0  0  0  1  0  0  0  0  0  0   1   1   1   1   1   1
4      4158  0  0  0  1  0  0  0  0  0  0   1   1   1   1   1   0
5      4157  0  0  0  1  0  0  0  0  0  0   1   1   1   1   0   1
6      4156  0  0  0  1  0  0  0  0  0  0   1   1   1   1   0   0
7      4155  0  0  0  1  0  0  0  0  0  0   1   1   1   0   1   1
8      4154  0  0  0  1  0  0  0  0  0  0   1   1   1   0   1   0

older proposed method:

lh = np.unpackbits((df.values & 0xFF).astype('uint8')).reshape(-1,8)
uh = np.unpackbits((df.values >> 8).astype('uint8')).reshape(-1,8)

df2 = pd.concat([df, pd.DataFrame(np.concatenate([uh,lh],axis=1),index=df.index)],axis=1)

Benchmark: numpy is orders of magnitude faster than pandas" For 1M points:

  • numpy view: 35ms for 1million uint64 points
  • numpy low/high: 50ms
  • pandas list bin: 1.78s
  • pandas apply format + list: 1.97s
  • pandas apply lambda: 6.08s
df = pd.DataFrame({'EVENT_ID': (np.random.random(int(1e6))*65000).astype('uint16')})

pandas apply format list

In [13]: %timeit df2 = df.join(pd.DataFrame(df['EVENT_ID'].apply('{0:b}'.format).apply(list).tolist()))
1.97 s ± 42.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

pandas list bin

In [10]: %%timeit
    ...: binary_values = pd.DataFrame([list(bin(x)[2:]) for x in df['EVENT_ID']])
    ...: df2 = df.join(binary_values)
    ...:
    ...:
1.78 s ± 53.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

pandas3 apply lambda

In [5]: %%timeit
   ...: for i in range(16):
   ...:     df[f"bit{i}"] = df["EVENT_ID"].apply(lambda x: x & 1 << i).astype(bool).astype(int)
   ...:
6.08 s ± 65.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

numpy

In [14]: %%timeit
    ...: lh = np.unpackbits((df.values & 0xFF).astype('uint8')).reshape(-1,8)
    ...: uh = np.unpackbits((df.values >> 8).astype('uint8')).reshape(-1,8)
    ...: df3=pd.concat([df, pd.DataFrame(np.concatenate([uh,lh],axis=1),index=df.index)],axis=1)
    ...:
    ...:
49.9 ms ± 232 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Zaero Divide
  • 699
  • 2
  • 10
0

You can try turn the int Series to binary by applying '{0:b}'.format, then split the list column to multiple columns with pd.DataFrame

df = df.join(pd.DataFrame(df['EVENT_ID'].apply('{0:b}'.format).apply(list).tolist()))
print(df)

   EVENT_ID  0  1  2  3  4  5  6  7  8  9 10 11 12
0      4162  1  0  0  0  0  0  1  0  0  0  0  1  0
1      4161  1  0  0  0  0  0  1  0  0  0  0  0  1
2      4160  1  0  0  0  0  0  1  0  0  0  0  0  0
3      4159  1  0  0  0  0  0  0  1  1  1  1  1  1
4      4158  1  0  0  0  0  0  0  1  1  1  1  1  0
5      4157  1  0  0  0  0  0  0  1  1  1  1  0  1
6      4156  1  0  0  0  0  0  0  1  1  1  1  0  0
7      4155  1  0  0  0  0  0  0  1  1  1  0  1  1
8      4154  1  0  0  0  0  0  0  1  1  1  0  1  0
Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52