How to find intersection of two column rows of data frame that is grouped by and remove that value from cells that contains it?

Question

I have a dataframe as follows:

name  teamA   teamB
foo    a        b
foo    b        c
foo    c        b
bar    a        e
bar    a        d
...

I want to find intersection of rows for each name separately but for both columns teamA and teamB. And after that to remove value of cell that contains that intersection value. In this example, for name "foo" intersection of rows would be "b", and for name "bar" would be "a". So data frame after removing this intersection values would look like:

name  teamA   teamB
foo     a      " "
foo    " "      c
foo     c      " "
bar    " "      e
bar    " "      d
...

Recently, I've tried with teamA and teamB as one column named for example teams.

name   teams
foo    [a, b]
foo    [b, c]
foo    [c, b]
...

after I would like to get

name   teams
foo    [a, " "]
foo    [" ", c]
foo    [c, " "]
...

But I've found it is more recommended to separate it in two columns and I found answer that is interesting but I don't know how to apply it on grouped data frame. https://stackoverflow.com/a/55554709/9168586 (look at "Filter on MANY Columns" section and "to retain rows where at least one column is True"). As in that example:

dataframe[['teamA', 'teamB']].isin('b').any(axis=1)

0     True
1     True
2     True
3     True
dtype: bool

where 'b' would be one of the values(teams) through which I would iterate. After every iteration if whole column is True I would remove that value from columns teamA or teamB in every row and continue to another group.

Errors that I get are:

Cannot access callable attribute 'isin' of 'DataFrameGroupBy' objects, try using the 'apply' method

and

only list-like or dict-like objects are allowed to be passed to DataFrame.isin(), you passed a 'str'

score 3 · Answer 1 · answered Oct 29 '19 at 15:28

3

We can do melt , then drop the duplicate , and pivot it back

s=df.reset_index().melt(['index','name']).\
      drop_duplicates(['name','value'],keep=False).\
         pivot_table(index=['index','name'],columns='variable',values='value',aggfunc='first').\
            fillna('').reset_index(level=1)
s['team']=list(zip(s.teamA,s.teamB))
s
Out[102]: 
variable name teamA teamB   team
index                           
0         foo     a        (a, )
1         foo           c  (, c)
2         foo           d  (, d)
3         bar           e  (, e)
4         bar           d  (, d)

answered Oct 29 '19 at 15:28

BENY

317,841
20
164
234

So, my question is obviously not complete. What if I have to group by two columns? And what if there is more than two duplicates like in example bellow? `name teamA teamB foo a b foo b c foo b c bar a e bar a d ` I want just to remove b values, because they are shown in every row. I can do that like removing the most frequent value or something like that, but my main problem is grouping by two columns. – nishikori Oct 30 '19 at 21:49

score 2 · Answer 2 · answered Oct 29 '19 at 16:51

2

Try groupby and apply stack, drop_duplicates, unstack, fillna

(df[['teamA', 'teamB']].groupby(df.name, sort=False)
                       .apply(lambda x: x.stack().drop_duplicates(keep=False))
                       .unstack().fillna('').reset_index('name'))

Out[93]:
  name teamA teamB
0  foo     a
1  foo           c
2  foo           d
3  bar           e
4  bar           d

answered Oct 29 '19 at 16:51

Andy L.

24,909
4
17
29

So, my question is obviously not complete. What if I have to group by two columns? And what if there is more than two duplicates like in example bellow(also I will edit my main question so you can see it nice)? `name teamA teamB foo a b foo b c foo b c bar a e bar a d` I want just to remove b values, because they are shown in every row. I can do that like removing the most frequent value or something like that, but my main problem is grouping by two columns. – nishikori Oct 30 '19 at 22:00

score 1 · Answer 3 · answered Oct 29 '19 at 15:37

Maybe is not as nice as @WeNYoBen sulution but you could consider to use a custom function which is pretty flexible

import pandas as pd
df = pd.DataFrame({"name":["foo"]*3+["bar"]*2,
                   "teamA":["a", "b", "b", "a", "a"],
                   "teamB":["b", "c", "d", "e", "d"]})


def fun(x):
    toRemove = list(set(x["teamA"].values).intersection(x["teamB"]))
    for col in ["teamA", "teamB"]:
        x[col] = np.where(x[col].isin(toRemove), " ", x[col])
    return x


df.groupby("name").apply(fun)

which output is:

  name teamA teamB
0  foo     a      
1  foo           c
2  foo           d
3  bar     a     e
4  bar     a     d

score 1 · Answer 4 · edited Jun 20 '20 at 09:12

`groupby.apply` + `Series.isin`.

Sample DataFrame:

print(df)

  name teamA teamB
0  foo     a     b
1  foo     b     c
2  foo     b     d
3  bar     a     e
4  bar     a     d
5  bar     b     a

new_df=df.copy()
groups=df.groupby('name',sort=False)
new_df['teamA']=groups.apply(lambda x: x['teamA'].mask(x['teamA'].isin(x['teamB']),' ')).reset_index(drop=True)
new_df['teamB']=groups.apply(lambda x: x['teamB'].mask(x['teamB'].isin(x['teamA']),' ')).reset_index(drop=True)
print(new_df)

  name teamA teamB
0  foo     a      
1  foo           c
2  foo           d
3  bar           e
4  bar           d
5  bar     b

Then use DataFrame.apply + join and split to get teams column:

new_df['teams']=new_df[['teamA','teamB']].apply(lambda x: ','.join(x).split(','),axis=1)
print(new_df)

  name teamA teamB   teams
0  foo     a        [a,  ]
1  foo           c  [ , c]
2  foo           d  [ , d]
3  bar           e  [ , e]
4  bar           d  [ , d]
5  bar     b        [b,  ]

Your answer is very useful, but my question wasn't complete. Look at my edited table in main question. — nishikori, Oct 31 '19 at 12:05

score 0 · Answer 5 · answered Oct 31 '19 at 13:31

After I edited my question yesterday... This is my data frame(df):

name  teamA   teamB year
foo    a        b    1
foo    b        c    1
foo    c        b    1
bar    a        e    2
bar    a        d    2
foo    a        h    2
foo    h        c    2
foo    h        b    2
...

This is the solution:

def fun(x):
    melted = pd.melt(x.reset_index(), id_vars=['name', 'year'], value_vars=['teamA', 'teamB'], var_name='var_name',
                    value_name='team')
    toRemove = melted.team.mode().iloc[0]
    for col in ["teamA", "teamB"]:
        x[col] = x[col].replace(toRemove,'something')
    return x


df = df.groupby(["name", "year"]).apply(fun)

So, I melt my dataframe and find the most frequent value after what I remove that value from two columns. Thanks @rpanai! Every answer was helpful, but your the most!

How to find intersection of two column rows of data frame that is grouped by and remove that value from cells that contains it?

5 Answers5

groupby.apply + Series.isin.

`groupby.apply` + `Series.isin`.