0

I have some boolean variables in a pandas dataframe and I need to get all unique tuples. So my idea was to create a new column of concatenated values of my variables then use pandas.DataFrame.unique() to get all unique tuples.

So my idea was to concatenate using binary developpment. For instance, for the dataframe :

import pandas as pd
df = pd.DataFrame({'v1':[0,1,0,0,1],'v2':[0,0,0,1,1], 'v3':[0,1,1,0,1], 'v4':[0,1,1,1,1]})

I could create a column as such :

df['added'] = df['v1'] + df['v2']*2 + df['v3']*4 + df['v4']*8

My idea was to iterate on the list of variables like this (it should be noted that on my real problem I do not know the number of columns):

variables = ['v1', 'v2', 'v3', 'v4']
df['added'] = df['v1']
for ind, var in enumerate(variables[1:]) :
   df['added'] = df['added'] + df[var] << ind

This however throws an error : "TypeError: unsupported operand type(s) for << : 'Series' and 'int' .

I can solve my problem with pandas.DataFrame.apply() as such :

variables = ['v1', 'v2', 'v3', 'v4']
df['added'] = df['v1']
for ind, var in enumerate(variables[1:]) :
   df['added'] = df['added'] + df[var].apply(lambda x : x << ind )

However, apply is (typically) slow. How can I do things more efficiently?

Thanks in advance

M

Matina G
  • 1,452
  • 2
  • 14
  • 28

3 Answers3

1

Getting unique rows is the same operation as drop_duplicates. (By finding all the duplicate rows and dropping them it leaves only unique rows.)

df[["v2","v3","v4"]].drop_duplicates()
Charles Landau
  • 4,187
  • 1
  • 8
  • 24
1

Use this solution, only simplify, because ordereing is already swapped:

df['new'] = df.values.dot(1 << np.arange(df.shape[-1]))
print (df)
   v1  v2  v3  v4  new
0   0   0   0   0    0
1   1   0   1   1   13
2   0   0   1   1   12
3   0   1   0   1   10
4   1   1   1   1   15

Performance in 1000 rows and 4 columns:

np.random.seed(2019)

N= 1000
df = pd.DataFrame(np.random.choice([0,1], size=(N, 4)))
df.columns = [f'v{x+1}' for x in df.columns]

In [60]: %%timeit
    ...: df['new'] = df.values.dot(1 << np.arange(df.shape[-1]))
113 µs ± 1.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Yuca solution:

In [65]: %%timeit
    ...: variables = ['v1', 'v2', 'v3', 'v4']
    ...: df['added'] = df['v1']
    ...: for ind, var in enumerate(variables[1:]) :
    ...:     df['added'] = df['added'] + [x<<ind for x in df[var]]
    ...: 
1.82 ms ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Original solution:

In [66]: %%timeit
    ...: variables = ['v1', 'v2', 'v3', 'v4']
    ...: df['added'] = df['v1']
    ...: for ind, var in enumerate(variables[1:]) :
    ...:    df['added'] = df['added'] + df[var].apply(lambda x : x << ind )
    ...: 
3.14 ms ± 8.52 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • +1 for your solution. However, it either consumes memory (creating the array a) or repeats the operation df.values twice , so I am not sure this is the best that can be done... – Matina G Apr 02 '19 at 13:30
  • @MatinaG - Not understand, why twice? In my opinion this should be most optimal solution - because working with all data only once, not loop by each column separately – jezrael Apr 02 '19 at 13:32
  • You either stock df.values in a separate variable ( a ) or, if you don't , you have to do something like df['new'] = df.values.dot(1 << np.arange(df.values.shape[-1])) .. Thank for the suggestion – Matina G Apr 02 '19 at 13:47
  • @MatinaG - I get better solution with avoid double `df.values`, also added timings for compare performance. – jezrael Apr 03 '19 at 06:24
0

Answering your question of a more efficient alternative, I found that list comprehension does help you a bit:

variables = ['v1', 'v2', 'v3', 'v4']
df['added'] = df['v1']
for ind, var in enumerate(variables[1:]) :
    %timeit df['added'] = df['added'] + [x<<ind for x in df[var]]

308 µs ± 22.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
322 µs ± 19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
316 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So 315 µs vs :

variables = ['v1', 'v2', 'v3', 'v4']
df['added'] = df['v1']
for ind, var in enumerate(variables[1:]) :
    %timeit df['added'] = df['added'] + df[var].apply(lambda x : x << ind )

500 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
503 µs ± 32.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
481 µs ± 32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

As a disclaimer, I don't agree with the value of the sum, but that's a different topic :)

Yuca
  • 6,010
  • 3
  • 22
  • 42