0

I have a pandas DF that looks like below and I'm trying to transpose counts of values in different rows to the format below :

   COL1    COL2    COL3
    C1      None    None
    C1      C2      None
    C1      C1      None
    C1      C2      C3

to

    C1  C2  C3
     1   0   0
     1   1   0
     2   0   0
     1   1   1

I found this answer here (Count occurrences of items in Series in each row of a DataFrame)

This does what I need but it's extremely slow on my dataset (4 Million Rows). I tried to do it with 400,000 rows and it still isn't running at an acceptable speed.

I believe it's the list doing a lookup through the massive dataset and with apply being a essentially a for loop. It's massive loops that's slowing the process. The potential solutions here essentially mention not to use Apply which would make the execution here tedious given that I have C1,C2...C100 in my list and 10 vertical columns to validate counts against. Any tips on what I might be able to try here to improve performance?

Here's the code below :

df_store = df.apply(pd.Series.value_counts, axis=1)[list_lookup].fillna(0)

2 Answers2

0

That is why we usually do not using apply

df.stack().str.get_dummies().sum(level=0).drop('None',1)
Out[157]: 
   C1  C2  C3
0   1   0   0
1   1   1   0
2   2   0   0
3   1   1   1

Or do with Counter

from  collections import Counter


pd.DataFrame([ Counter(x) for x in df.values]).drop('None',1)
Out[170]: 
   C1   C2   C3
0   1  NaN  NaN
1   1  1.0  NaN
2   2  NaN  NaN
3   1  1.0  1.0
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Thank you for responding! The stack method has been running for about 20 mins on 400k rows. The counter would still be low right? Because it's looping through each row. Wondering if I should convert to individual rows and run it separately before recombining. – Ashwin Shankar Jun 17 '19 at 22:37
0

If you want faster solution, you need to use numpy. Using one-hot encoding from numpy and assign the ndarray result to dataframe. Base on my %timeit, it is about 6x faster than Counter solution. The disadvantage is you need to know in advance the values to encode which in your case are C1, C2, C3....C100

encode_val = np.array(['C1', 'C2', 'C3'])
encode_arr = (df.values[...,None] == encode_val).astype(int).sum(1)

Out[237]:
array([[1, 0, 0],
       [1, 1, 0],
       [2, 0, 0],
       [1, 1, 1]])

pd.DataFrame(encode_arr, columns=encode_val)

Out[238]:
   C1  C2  C3
0   1   0   0
1   1   1   0
2   2   0   0
3   1   1   1
Andy L.
  • 24,909
  • 4
  • 17
  • 29