-1

My apologies SO community, I am a newbie on the platform and in the pursuit of making this question precise and straight to the point, I didn't give relevant info.

My Input Dataframe is:

import pandas as pd
data = {'user_id': ['abc','def','ghi'],
   'alpha': ['A','B,C,D,A','B,C,A'],
   'beta': ['1|20|30','350','376|98']}

df = pd.DataFrame(data = data, columns = ['user_id','alpha','beta'])
print(df)

Looks like this,

     user_id   alpha      beta
0     abc        A     1|20|30
1     def  B,C,D,A         350
2     ghi    B,C,A         376

I want something like this,

     user_id   alpha      beta  a_A  a_B  a_C a_D b_1 b_20 b_30 b_350 b_376

0     abc        A     1|20|30    1    0    0   0   1    1    1     1     0
1     def  B,C,D,A         350    1    1    1   1   0    0    0     1     0
2     ghi    B,C,A         376    1    1    1   0   0    0    0     0     1

My original data contains 11K rows. And these distinct values in alpha & beta are around 550.

I created a list from all the values in alpha & beta columns and applied pd.get_dummies but it results in a lot of rows like the one displayed by @wwwnde. I would like all the rows to be rolled up based on user_id.

A similar idea is used by CountVectorizer on documents, where it creates columns based on all the words in the sentence and checks the frequency of a word. However, I am guessing Pandas has a better and efficient way to do that.

Grateful for all your assistance. :)

Desired Output

Sasha18
  • 65
  • 8
  • Please, read [How to create a Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example). – 89f3a1c Sep 12 '20 at 02:34
  • @Sasha18 see answer below. You were down voted by someone because of the way you presented your question. Please dont sent images. Most of the time you are down voted. You can copy the dataframe and paste and format. If you need further help let us know. Happy coding. – wwnde Sep 12 '20 at 03:08
  • Always provide a [mre], with **code, data, errors, current output, and expected output, as text**, not screenshots, because [SO Discourages Screenshots](https://meta.stackoverflow.com/questions/303812/). It is likely the question will be down-voted and closed. You are discouraging assistance because no one wants to retype your data or code, and screenshots are often illegible. [edit] the question and **add text**. Please see [How to provide a reproducible copy of your DataFrame using `df.head(30).to_clipboard(sep=',')`](https://stackoverflow.com/questions/52413246) – Trenton McKinney Sep 12 '20 at 04:07

1 Answers1

1

You will have to achieve that in a series of steps.

Sample Data

id  ALPHA          BETA
0   1   A         1|20|30
1   2   B,C,D,A   350
2   3   B,C,A     395|45|90

Create Lists for values in ALPHA and BETA

df.BETA=df.BETA.apply(lambda x: x.split('|'))#.str.join(',')
df.ALPHA=df.ALPHA.apply(lambda x: x.split(','))#.str.join(',')

Disintegrate the list elements into individuals

df=df.explode('ALPHA')
df=df.explode('BETA')

Extract the variable frequencies using get dummies.

pd.get_dummies(df)

Strip columns of the prefix

df.columns=df.columns.str.replace('ALPHA_|BETA_','')

   id  A  B  C  D  1  20  30  350  395  45  90
0   1  1  0  0  0  1   0   0    0    0   0   0
0   1  1  0  0  0  0   1   0    0    0   0   0
0   1  1  0  0  0  0   0   1    0    0   0   0
1   2  0  1  0  0  0   0   0    1    0   0   0
1   2  0  0  1  0  0   0   0    1    0   0   0
1   2  0  0  0  1  0   0   0    1    0   0   0
1   2  1  0  0  0  0   0   0    1    0   0   0
2   3  0  1  0  0  0   0   0    0    1   0   0
2   3  0  1  0  0  0   0   0    0    0   1   0
2   3  0  1  0  0  0   0   0    0    0   0   1
2   3  0  0  1  0  0   0   0    0    1   0   0
2   3  0  0  1  0  0   0   0    0    0   1   0
2   3  0  0  1  0  0   0   0    0    0   0   1
2   3  1  0  0  0  0   0   0    0    1   0   0
2   3  1  0  0  0  0   0   0    0    0   1   0
2   3  1  0  0  0  0   0   0    0    0   0   1
wwnde
  • 26,119
  • 6
  • 18
  • 32
  • Thanks @wwnde, I have edited my question to follow SO guidelines. :) I have achieved the result you presented, I am stuck at the point where I would like to roll up based on id column. I have edited my question for clarity. – Sasha18 Sep 13 '20 at 02:34
  • `code` cs = (('alpha', ','), ('beta', '|')) df1 = pd.concat([df] + [df[c].str.get_dummies(sep=s) for c, s in cs], axis=1) This code will solve this problem – Sasha18 Sep 13 '20 at 12:09