1

I have this dataframe:

In [6]: import pandas as pd                                                                                                            

In [7]: import numpy as np                                                                                                             

In [8]: df = pd.DataFrame(data = np.nan, 
   ...:                   columns = ['A', 'B', 'C', 'D', 'E'], 
   ...:                   index = ['A', 'B', 'C', 'D', 'E']) 
   ...:                  
   ...: df['list_of_codes'] = [['A' , 'B'], 
   ...:                        ['A', 'B', 'E'], 
   ...:                        ['C', 'D'], 
   ...:                        ['B', 'D'], 
   ...:                        ['E']] 
   ...:  
   ...: df                                                                                                                             
Out[8]: 
    A   B   C   D   E list_of_codes
A NaN NaN NaN NaN NaN        [A, B]
B NaN NaN NaN NaN NaN     [A, B, E]
C NaN NaN NaN NaN NaN        [C, D]
D NaN NaN NaN NaN NaN        [B, D]
E NaN NaN NaN NaN NaN           [E]

And now I want to insert a '1' where both the index and column name are present inside of the list in the column df['list_of_codes']. The result would look like this:

    A   B   C   D   E list_of_codes
A   1   1   0   0   0        [A, B]
B   1   1   0   0   1     [A, B, E]
C   0   0   1   1   0        [C, D]
D   0   1   0   1   0        [B, D]
E   0   0   0   0   1           [E]

I have tried something like this:

df.apply(lambda x: 1 if x[:-1] in (x[-1]) else 0, axis=1, result_type='broadcast')

but get the error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I don't think I understand this error exactly but then I try:

df.apply(lambda x: 1 if x[:-1].any() in (x[-1]) else 0, axis=1, result_type='broadcast')

This runs but does not give me the desired result. Instead it returns:

    A   B   C   D   E list_of_codes
A   0   0   0   0   0             0
B   0   0   0   0   0             0
C   0   0   0   0   0             0
D   0   0   0   0   0             0
E   0   0   0   0   0             0

Can someone help me understand what I need in my pd.apply() and lambda functions in order to broadcast the '1's in the way that I am trying to? Thanks in advance!

Bo Reppen
  • 21
  • 1
  • 5

2 Answers2

2

IIUC, Series.explode and then Series.str.get_dummies to check . Finally, we can use groupby.max to assign to the original dataframe

df = df.assign(**df['list_of_codes'].explode()
                            .str.get_dummies()
                            .groupby(level=0).max())
print(df)

Output

   A  B  C  D  E list_of_codes
A  1  1  0  0  0        [A, B]
B  1  1  0  0  1     [A, B, E]
C  0  0  1  1  0        [C, D]
D  0  1  0  1  0        [B, D]
E  0  0  0  0  1           [E]

Alternative without explode

df = df.assign(**pd.DataFrame(df['list_of_codes'].tolist(),
                               index = df.index).stack()
                                                .str.get_dummies()
                                                .groupby(level=0)
                                                .max())

EDIT

I think explode is somewhat faster, since in the alternative I propose at the end we are creating a dataframe and then using stack. We can rely on this post : SO explode to use explode. On the other hand we can use the level accessor instead of groupby. Well try to explode by another method of publication and find the method that provides better performance.

index = df.index
df[index] = pd.get_dummies(pd.Series(data = np.concatenate(s.values),
                         index = index.repeat(s.str.len()))).sum(level=0)

Another approach with pd.Index.isin:

index=df.index
df[index] = [index.isin(l).astype(int) for l in df['list_of_codes']]

I think it could be the fastest

We could also consider writing only true or false. It would be faster.

index=df.index
df[index] = [index.isin(l) for l in df['list_of_codes']]
ansev
  • 30,322
  • 5
  • 17
  • 31
  • Thanks for your answer! Unfortunately, I only have panda 0.23 available to me in the environment that I am using, and therefore, do not have Series.explode(). The alternative does work but can take a long time on a large dataset. I should have mentioned that the shape of the full unsampled dataframe is ~(15000,15000). Will Series.explode() (if made available to me) result in a faster run time or does it have comparable efficiency to the alternative solution? Any thoughts/advice would be highly appreciated! Thanks! – Bo Reppen Mar 17 '20 at 14:54
0

I can not make a comment "less than 50 reputation", but I do tested ansev's solution with a 15000*15000 size df here is the way I build a test df:

import numpy as np
import pandas as pd
nelem = 15000
elements = range(nelem)

x=np.random.randint(low=1, high=len(elements), size=nelem)
list_of_codes=[]
for i in range(nelem):
    list_of_codes.append(np.random.choice(elements,size=x[i]))
df = pd.DataFrame(data = {"list_of_codes":list_of_codes})
for x in elements:
    df[x]=np.nan 

I tested it on the colab it gave me this outcome:

%timeit df[index] = [index.isin(l) for l in df['list_of_codes']]

The slowest run took 26.21 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 3.04 s per loop

So ansev's solution does work in your case.

Sun
  • 64
  • 4