Get distinct words on groupby Pandas dataframe

Question

How to get distinct words of a column based on group by of another column

I need to get distinct colB words for each colA value

my dataframe:

colA     colB
US       California City
US       San Jose ABC
UK       London 123
US       California ZZZ
UK       Manchester
UK       London

Reqd dataframe (df):

col A    colB
US       California
US       City
US       ABC
US       ZZZ
US       San
US       Jose
UK       London
UK       123
UK       Manchester

EDIT:

Thanks to @jezrael, I was able to get the desired dataframe

I have another dataframe (df2)

ColC        ColA      ColB
C1          US        California
C1          US        ABC
C2          UK        LONDON

For each value of column (colC), i need the intersection of colB strings with the previously obtained dataframe.

Required:

ColC     n(df2_colBuniq)    n(df_df2_intersec_colB)
C1       2               2
C2       1               1

I tried looping through each unique colC value, but for the large data frame I have, it is taking quite some time. Any suggestions?

Use `df = df.drop_duplicates()` Or `df = df.drop_duplicates(['col A','colB'])` — jezrael, Mar 15 '18 at 12:20
@jezrael: I need distinct words (separated by space) not distinct colB values. I was not very clear about it the first time — msksantosh, Mar 15 '18 at 12:24
@jezrael: I added a follow up under the EDIT: in the question. Any advice? — msksantosh, Mar 15 '18 at 13:27
@msksantosh - I think the best is create new question, can you do it? — jezrael, Mar 15 '18 at 13:28
@jezrael: Sure, but it allows me to post one question every 90 minutes, so i have to wait for some time to post it as another question — msksantosh, Mar 15 '18 at 13:35
@msksantosh - I try understand your new question, but I have problem. Can you explain more? — jezrael, Mar 15 '18 at 13:53
@jezrael: Sure, I have another dataframe with additional column colC, For each unique value of colC, I am trying to get the unique colB values and intersection of colB in the two dataframes — msksantosh, Mar 15 '18 at 13:57
@msksantosh - first column is clear, but in second why is `1` for `C1` ? — jezrael, Mar 15 '18 at 14:39
@msksantosh - Is possible change `df2` for different values in both column? e.g. in first. — jezrael, Mar 15 '18 at 14:52

score 2 · Accepted Answer · answered Mar 15 '18 at 12:27

Use:

set_index and select colB
split by whitespaces to DataFrame
reshape by stack to Series
reset_index for column from index
drop_duplicates

df = (df.set_index('colA')['colB']
        .str.split(expand=True)
        .stack()
        .reset_index(level=1, drop=True)
        .reset_index(name='colB')
        .drop_duplicates()
       )
print (df)
  colA        colB
0   US  California
1   US        City
2   US         San
3   US        Jose
4   US         ABC
5   UK      London
6   UK         123
8   US         ZZZ
9   UK  Manchester

score 1 · Answer 2 · answered Mar 15 '18 at 12:38

We can using get_dummies

df.set_index('colA').colB.str.get_dummies(sep=' ').sum(level=0).replace(0,np.nan).stack().reset_index()
Out[13]: 
  colA     level_1    0
0   US         ABC  1.0
1   US  California  2.0
2   US        City  1.0
3   US        Jose  1.0
4   US         San  1.0
5   US         ZZZ  1.0
6   UK         123  1.0
7   UK      London  2.0
8   UK  Manchester  1.0

Get distinct words on groupby Pandas dataframe

2 Answers2