binary shift of pandas series

Question

I have some boolean variables in a pandas dataframe and I need to get all unique tuples. So my idea was to create a new column of concatenated values of my variables then use pandas.DataFrame.unique() to get all unique tuples.

So my idea was to concatenate using binary developpment. For instance, for the dataframe :

import pandas as pd
df = pd.DataFrame({'v1':[0,1,0,0,1],'v2':[0,0,0,1,1], 'v3':[0,1,1,0,1], 'v4':[0,1,1,1,1]})

I could create a column as such :

df['added'] = df['v1'] + df['v2']*2 + df['v3']*4 + df['v4']*8

My idea was to iterate on the list of variables like this (it should be noted that on my real problem I do not know the number of columns):

variables = ['v1', 'v2', 'v3', 'v4']
df['added'] = df['v1']
for ind, var in enumerate(variables[1:]) :
   df['added'] = df['added'] + df[var] << ind

This however throws an error : "TypeError: unsupported operand type(s) for << : 'Series' and 'int' .

I can solve my problem with pandas.DataFrame.apply() as such :

variables = ['v1', 'v2', 'v3', 'v4']
df['added'] = df['v1']
for ind, var in enumerate(variables[1:]) :
   df['added'] = df['added'] + df[var].apply(lambda x : x << ind )

However, apply is (typically) slow. How can I do things more efficiently?

Thanks in advance

M

there's something odd about your code, the `df['var']` should be `df[var]` no? — Yuca, Apr 02 '19 at 13:08

score 1 · Answer 1 · answered Apr 02 '19 at 13:11

1

Getting unique rows is the same operation as drop_duplicates. (By finding all the duplicate rows and dropping them it leaves only unique rows.)

df[["v2","v3","v4"]].drop_duplicates()

answered Apr 02 '19 at 13:11

Charles Landau

4,187
1
8
24

Sure, thanks a lot. I gave you plus 1 but since the question was mostly about binary shift on pandas series I will have to accept as correct an answer on that one. – Matina G Apr 02 '19 at 13:19
That makes sense to me – Charles Landau Apr 02 '19 at 13:20

jezrael · Accepted Answer · 2019-04-03T06:11:48.940

1

Use this solution, only simplify, because ordereing is already swapped:

df['new'] = df.values.dot(1 << np.arange(df.shape[-1]))
print (df)
   v1  v2  v3  v4  new
0   0   0   0   0    0
1   1   0   1   1   13
2   0   0   1   1   12
3   0   1   0   1   10
4   1   1   1   1   15

Performance in 1000 rows and 4 columns:

np.random.seed(2019)

N= 1000
df = pd.DataFrame(np.random.choice([0,1], size=(N, 4)))
df.columns = [f'v{x+1}' for x in df.columns]

In [60]: %%timeit
    ...: df['new'] = df.values.dot(1 << np.arange(df.shape[-1]))
113 µs ± 1.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Yuca solution:

In [65]: %%timeit
    ...: variables = ['v1', 'v2', 'v3', 'v4']
    ...: df['added'] = df['v1']
    ...: for ind, var in enumerate(variables[1:]) :
    ...:     df['added'] = df['added'] + [x<<ind for x in df[var]]
    ...: 
1.82 ms ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Original solution:

In [66]: %%timeit
    ...: variables = ['v1', 'v2', 'v3', 'v4']
    ...: df['added'] = df['v1']
    ...: for ind, var in enumerate(variables[1:]) :
    ...:    df['added'] = df['added'] + df[var].apply(lambda x : x << ind )
    ...: 
3.14 ms ± 8.52 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Apr 03 '19 at 06:11

answered Apr 02 '19 at 13:14

jezrael

822,522
95
1,334
1,252

+1 for your solution. However, it either consumes memory (creating the array a) or repeats the operation df.values twice , so I am not sure this is the best that can be done... – Matina G Apr 02 '19 at 13:30
@MatinaG - Not understand, why twice? In my opinion this should be most optimal solution - because working with all data only once, not loop by each column separately – jezrael Apr 02 '19 at 13:32
You either stock df.values in a separate variable ( a ) or, if you don't , you have to do something like df['new'] = df.values.dot(1 << np.arange(df.values.shape[-1])) .. Thank for the suggestion – Matina G Apr 02 '19 at 13:47
@MatinaG - I get better solution with avoid double `df.values`, also added timings for compare performance. – jezrael Apr 03 '19 at 06:24

score 0 · Answer 3 · answered Apr 02 '19 at 13:15

Answering your question of a more efficient alternative, I found that list comprehension does help you a bit:

variables = ['v1', 'v2', 'v3', 'v4']
df['added'] = df['v1']
for ind, var in enumerate(variables[1:]) :
    %timeit df['added'] = df['added'] + [x<<ind for x in df[var]]

308 µs ± 22.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
322 µs ± 19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
316 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So 315 µs vs :

variables = ['v1', 'v2', 'v3', 'v4']
df['added'] = df['v1']
for ind, var in enumerate(variables[1:]) :
    %timeit df['added'] = df['added'] + df[var].apply(lambda x : x << ind )

500 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
503 µs ± 32.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
481 µs ± 32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

As a disclaimer, I don't agree with the value of the sum, but that's a different topic :)

binary shift of pandas series

3 Answers3