3

I have a data set that looks like this-

ID        Search    Provider
1           Yes     A
1           Yes     B
1           No      B
1           No      C
2           Yes     D
2           Yes     A
2           Yes     B
2           No      B
2           No      C

What I want to find out is if the Providers for Search==Yes is different from Search==No for a given ID. E.g.- 'ID 1', 'Search=No' goes to Providers B,C whereas 'Search=Yes' goes to Providers A,B. So the provider A is new for ID 1.

I know I can use the isin function to identify the unique values between two lists. However, how do I do it across multiple rows of ID and Search? And how do I compile the Provider values into lists for each subgroup defined by ID and Search? I guess I will need to use nested loops but not being able to write the code. Would really appreciate if someone can help me on this.

busybear
  • 10,194
  • 1
  • 25
  • 42
  • 1
    Not 100% sure of your desired output. But seems like you want to split then you can `outer` `merge` with an indicator to figure out which providers exist in both, only Yes or only No. See https://stackoverflow.com/questions/53645882/pandas-merging-101 – ALollz Oct 22 '19 at 22:03

3 Answers3

1

Rather than compiling into lists, you might want to consider sets. In a generic sense, this might be more useful as I would assume order and redundancy doesn't matter. Also, it's easier to determine which providers are in one set and not another. You can rearrange your dataframe with pivot_table to do this:

df_new = df.pivot_table(index='ID', columns='Search', aggfunc=set).droplevel(0, axis=1)

Result:

Search      No        Yes
ID                       
1       {C, B}     {A, B}
2       {C, B}  {D, A, B}

With this new dataframe, you can compare values with the same 'ID' easiliy:

# df_new['No'] == df_new['Yes']   # If providers are the same between "yes" and "no"
df_new['Yes'] - df_new['No']      # Providers that are in "yes" but not "no"

Result (for set difference):

ID
1       {A}
2    {D, A}
dtype: object
busybear
  • 10,194
  • 1
  • 25
  • 42
0

This can be done in a few steps. First, group by ID and search, then get the unique values with value_counts.

>>> df1 = df.groupby(['ID', 'Search']).Provider.value_counts()
ID  Search  Provider
1   No      B           1
            C           1
    Yes     A           1
            B           1
2   No      B           1
            C           1
    Yes     A           1
            B           1
            D           1

For each ID/Provider combination, you can then get a count of the number of Yes/No Searches

>>> df2 = df1.unstack(level='Search', fill_value=0)
Search       No  Yes
ID Provider         
1  A          0    1
   B          1    1
   C          1    0
2  A          0    1
   B          1    1
   C          1    0
   D          0    1

From here, you can get the list of Provider/ID combos that have either Yes or No but not both

>>> df2 = df1.query('Yes != No')
Search       No  Yes
ID Provider         
1  A          0    1
   C          1    0
2  A          0    1
   C          1    0
   D          0    1
David Nehme
  • 21,379
  • 8
  • 78
  • 117
0

Method 1

You can use groupby.agg(set) first, then again groupby.diff:

dfg = df.groupby(['ID', 'Search']).agg(set).reset_index()
dfg.groupby('ID')['Provider'].diff().dropna()

1       {A}
3    {A, D}
Name: Provider, dtype: object`

Method 2

Splitting the dataset up in yes and no then groupby.set:

yes = df.loc[df['Search'] == 'Yes']
no  = df.loc[df['Search'] == 'No']

yes_agg = yes.groupby('ID')['Provider'].agg(set)
no_agg = no.groupby('ID')['Provider'].agg(set)

# get the difference between the sets
yes_agg - no_agg

ID
1       {A}
2    {A, D}
Name: Provider, dtype: object
Erfan
  • 40,971
  • 8
  • 66
  • 78