0

Say I have n dataframes, in this example n = 3.

**df1**
       A  B     C
0   True  3  21.0
1   True  1  23.0
2  False  2  25.0
3  False  4  25.5
4  False  1  25.0
5  True   0  26.0

**df2**
       A  B     C
0   True  3  21.0
1   True  1  23.0
2  False  2  25.0
3  False  4  25.5
4  True   2  19.0

**df3**
       A  B     C
0   True  3  21.0
1   True  2  23.0
2  False  2  25.0
3  False  1  25.0
4  False  4  25.5
5  True   0  27.50

**dfn** ...

I want one dataframe that includes all the rows where the value in Column C appears in every dataframe dfn. So this is a kind of the union of the intersection of dataframes on a Column, in this case Column C. So for the above dataframes, the rows with 19.0, 26.0 and 27.50 don't make it to the final dataframe which is:

**Expected df**
0   True  3  21.0
1   True  1  23.0
2  False  2  25.0
3  False  4  25.5
4  False  1  25.0
0   True  3  21.0
1   True  1  23.0
2  False  2  25.0
3  False  4  25.5
0   True  3  21.0
1   True  2  23.0
2  False  2  25.0
3  False  1  25.0
4  False  4  25.5

So a row lives on to the final dataframe, if and only if, the value in Column C is seen in all dataframes.

Reproducible code:

import pandas as pd
df1 = pd.DataFrame({'A': [True,True,False,False,False,True], 'B': [3,1,2,4,1,0],
                    'C': [21.0,23.0,25.0,25.5,25.0,26.0]})
df2 = pd.DataFrame({'A': [True,True,False,False,False], 'B': [3,1,2,4,2],
                    'C': [21.0,23.0,25.0,25.5,19.0]})
df3 = pd.DataFrame({'A': [True,True,False,False,False,True], 'B': [3,2,2,1,4,0],
                    'C': [21.0,23.0,25.0,25.0,25.5,27.5]})
dfn = ...
smci
  • 32,567
  • 20
  • 113
  • 146
Ivan
  • 7,448
  • 14
  • 69
  • 134
  • What's your expected results? I'm not sure if I understand the logic you described. – Allen Qin Feb 12 '18 at 23:27
  • See the edit I posted the expected result. – Ivan Feb 12 '18 at 23:30
  • I can't understand the question, and it disagrees with its own example. The value C = 23.0 is not duplicated in any of the dfs, yet it passes the filter. Finally at the bottom you state it properly: ***filter in a row if and only if, the value in Column C is seen in all dataframes*** Don't say "duplicated" – smci Feb 12 '18 at 23:36
  • Sorry I might have edited the data while you were looking at it. Now expected is correct. – Ivan Feb 12 '18 at 23:44
  • I posted a solution that does not require iteration, or an n x m intermediate matrix. – smci Feb 13 '18 at 00:37

3 Answers3

2

The straightforward approach seems to be to compute the (n-way intersection) common C values (as a set/list), then filter with .isin:

common_C_values = set.intersection(set(df1['C']), set(df2['C']), set(df3['C']))
df_all = pd.concat([df1,df2,df3])
df_all = df_all[ df_all['C'].isin(common_C_values) ]
smci
  • 32,567
  • 20
  • 113
  • 146
1

For simplicity, store your dataframes in a list. We'll make use of set operations to speed this up as much as possible.

df_list = [df1, df2, df3, ...]
common_idx = set.intersection(*[set(df['C']) for df in df_list]) 

print(common_idx)
{21.0, 23.0, 25.0, 25.5}

Thanks to @smci for the improvement! set.intersection will find the intersection of all the indices. Finally, call pd.concat, join the dataframes vertically, and then use query to filter on common indices obtained from the previous step.

pd.concat(df_list, ignore_index=True).query('C in @common_idx')

        A  B     C
0    True  3  21.0
1    True  1  23.0
2   False  2  25.0
3   False  4  25.5
4   False  1  25.0
5    True  3  21.0
6    True  1  23.0
7   False  2  25.0
8   False  4  25.5
9    True  3  21.0
10   True  2  23.0
11  False  2  25.0
12  False  1  25.0
13  False  4  25.5
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Ok but I need the result in the order of the original dataframe. They get stacked on ontop of the other based on the criteria. I guess I can sort that myself.... – Ivan Feb 12 '18 at 23:33
  • @Ivan I've posted a simpler solution that retains order. – cs95 Feb 12 '18 at 23:43
  • Thanks COLDSPEED. There is something very appealing about this solution. I think it is a meme to follow in other cases. – Ivan Feb 12 '18 at 23:57
  • @Ivan Might want to test both the answers. The other one looks slower to me. – cs95 Feb 12 '18 at 23:58
  • Iteratively doing `reduce(lambda x, y: x.intersection(y...` is not necessary ; [Python 2.6+ supports n-way `set.intersection(...)`](https://stackoverflow.com/questions/2541752/best-way-to-find-the-intersection-of-multiple-sets/9492293) – smci Feb 13 '18 at 00:34
  • @smci cheers. I tried that first, but forgot to unpack the list, so I didn't get it to work. Fixed now. – cs95 Feb 13 '18 at 00:55
  • 1
    Sure. You might acknowledge my suggestion in your answer for the new improved v2.0 ... ;-) – smci Feb 13 '18 at 01:09
  • 1
    @smci fair enough, also passed you an upvote. Cheers – cs95 Feb 13 '18 at 05:00
1

You can use pd.concat:

# merge column C from all DataFrames
df_C = pd.concat([df1,df2,df3],1)['C']
# concat all DataFrames
df_all = pd.concat([df1,df2,df3])
# only extract rows with its C value appears in all DataFrames C columns.
df_all.loc[df_all.apply(lambda x: df_C.eq(x.C).sum().all(), axis=1)]
Out[105]: 
       A  B     C
0   True  3  21.0
1   True  1  23.0
2  False  2  25.0
3  False  4  25.5
4  False  1  25.0
0   True  3  21.0
1   True  1  23.0
2  False  2  25.0
3  False  4  25.5
0   True  3  21.0
1   True  2  23.0
2  False  2  25.0
3  False  1  25.0
4  False  4  25.5
Allen Qin
  • 19,507
  • 8
  • 51
  • 67
  • Works beautifully.Thanks. – Ivan Feb 12 '18 at 23:56
  • Creating the n x m matrix `df_C` is a waste of time and memory. Just directly compute the common C values (as a set/list), then filter using the test `df['C'].isin()` – smci Feb 13 '18 at 00:32