My apologies SO community, I am a newbie on the platform and in the pursuit of making this question precise and straight to the point, I didn't give relevant info.
My Input Dataframe is:
import pandas as pd
data = {'user_id': ['abc','def','ghi'],
'alpha': ['A','B,C,D,A','B,C,A'],
'beta': ['1|20|30','350','376|98']}
df = pd.DataFrame(data = data, columns = ['user_id','alpha','beta'])
print(df)
Looks like this,
user_id alpha beta
0 abc A 1|20|30
1 def B,C,D,A 350
2 ghi B,C,A 376
I want something like this,
user_id alpha beta a_A a_B a_C a_D b_1 b_20 b_30 b_350 b_376
0 abc A 1|20|30 1 0 0 0 1 1 1 1 0
1 def B,C,D,A 350 1 1 1 1 0 0 0 1 0
2 ghi B,C,A 376 1 1 1 0 0 0 0 0 1
My original data contains 11K rows. And these distinct values in alpha & beta are around 550.
I created a list from all the values in alpha & beta columns and applied pd.get_dummies but it results in a lot of rows like the one displayed by @wwwnde. I would like all the rows to be rolled up based on user_id.
A similar idea is used by CountVectorizer on documents, where it creates columns based on all the words in the sentence and checks the frequency of a word. However, I am guessing Pandas has a better and efficient way to do that.
Grateful for all your assistance. :)