I have a pandas DF that looks like below and I'm trying to transpose counts of values in different rows to the format below :
COL1 COL2 COL3
C1 None None
C1 C2 None
C1 C1 None
C1 C2 C3
to
C1 C2 C3
1 0 0
1 1 0
2 0 0
1 1 1
I found this answer here (Count occurrences of items in Series in each row of a DataFrame)
This does what I need but it's extremely slow on my dataset (4 Million Rows). I tried to do it with 400,000 rows and it still isn't running at an acceptable speed.
I believe it's the list doing a lookup through the massive dataset and with apply being a essentially a for loop. It's massive loops that's slowing the process. The potential solutions here essentially mention not to use Apply which would make the execution here tedious given that I have C1,C2...C100 in my list and 10 vertical columns to validate counts against. Any tips on what I might be able to try here to improve performance?
Here's the code below :
df_store = df.apply(pd.Series.value_counts, axis=1)[list_lookup].fillna(0)