2

How to get distinct words of a column based on group by of another column

I need to get distinct colB words for each colA value

my dataframe:

colA     colB
US       California City
US       San Jose ABC
UK       London 123
US       California ZZZ
UK       Manchester
UK       London

Reqd dataframe (df):

col A    colB
US       California
US       City
US       ABC
US       ZZZ
US       San
US       Jose
UK       London
UK       123
UK       Manchester

EDIT:

Thanks to @jezrael, I was able to get the desired dataframe

I have another dataframe (df2)

ColC        ColA      ColB
C1          US        California
C1          US        ABC
C2          UK        LONDON

For each value of column (colC), i need the intersection of colB strings with the previously obtained dataframe.

Required:

ColC     n(df2_colBuniq)    n(df_df2_intersec_colB)
C1       2               2
C2       1               1

I tried looping through each unique colC value, but for the large data frame I have, it is taking quite some time. Any suggestions?

msksantosh
  • 379
  • 4
  • 19

2 Answers2

2

Use:


df = (df.set_index('colA')['colB']
        .str.split(expand=True)
        .stack()
        .reset_index(level=1, drop=True)
        .reset_index(name='colB')
        .drop_duplicates()
       )
print (df)
  colA        colB
0   US  California
1   US        City
2   US         San
3   US        Jose
4   US         ABC
5   UK      London
6   UK         123
8   US         ZZZ
9   UK  Manchester
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

We can using get_dummies

df.set_index('colA').colB.str.get_dummies(sep=' ').sum(level=0).replace(0,np.nan).stack().reset_index()
Out[13]: 
  colA     level_1    0
0   US         ABC  1.0
1   US  California  2.0
2   US        City  1.0
3   US        Jose  1.0
4   US         San  1.0
5   US         ZZZ  1.0
6   UK         123  1.0
7   UK      London  2.0
8   UK  Manchester  1.0
BENY
  • 317,841
  • 20
  • 164
  • 234