How to find the intersection of multiple pandas dataframes on a non index column

Question

I have found an answer that works for me to this Q and will post as well.

How to take a list of dataframes and create a new df that has only the rows that share a common value in a non index column?

Basically an intersection, but concat and merge wouldn't work for me for a number of reasons.

I looked at the following and didn't get what i needed:

Finding common rows (intersection) in two Pandas dataframes

Pandas merge df error

How to get intersection of dataframes based on column labels?

Intersection of multiple pandas dataframes

How to find intersection of dataframes based on multiple columns?

Intersection of pandas dataframe with multiple columns

How to do intersection of dataframes in pandas

This would do the trick: https://stackoverflow.com/a/26921975/11610186 — Grzegorz Skibinski, Dec 14 '19 at 19:49
merge can be made to work but only does 2 dfs, so for list of dfs a loop is still needed. the loop below avoids generating pairwise combinatorics and iteratively reducing. — Ryan Hunt, Dec 14 '19 at 19:55
Ah, this one I would keep, as per your answer. I would just change the intersection of 2 - the link I shared has better option for this bit, than what you proposed (i.e. the ```loc``` thing) — Grzegorz Skibinski, Dec 14 '19 at 19:58
the problem with merge is that is mangles the index values and the column headings. I will list the problem with merge in the answer. — Ryan Hunt, Dec 14 '19 at 20:35
@Grzegorz Skibinski If you agree with the Q and/or the A can you please upvote one or both? — Ryan Hunt, Dec 14 '19 at 22:12
No, sorry- I don't. Still doing ```pd.merge(df1, df2[col], on=col,...)``` would be better. In your solution you literally return whole ```Series``` and check every single row of other dataframe against every row of this returned one (it's ```O(n^2)```). There's no way really you can't make any form of ```join``` work for you instead. — Grzegorz Skibinski, Dec 15 '19 at 08:56
Also, from this page it seems that loc does not go row by row but works on the entire set at once. Why do you think otherwise?https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html — Ryan Hunt, Dec 16 '19 at 16:49

Ryan Hunt · Accepted Answer · 2019-12-14T20:39:56.553

This is what I ended up implementing. Would like to see if anyone has other ways of doing this more efficiently.

import copy

dfs_array = [ df1, df2, df3, df4, ... ]

def intersection_of_dfs(dfs_array,col='Ticker'):
    if len(dfs_array) <= 1:
        # if we only have 1 or 0 elements simply return the origial array
        # no error is given, logic must be created for the return value.
        return dfs_array
    # does a shallow copy only.
    dfs = copy.copy(dfs_array)
    length = len(dfs) 
    while length > 1:
        df1 = dfs.pop()
        df2 = dfs.pop()
        df0 = df1.loc[ df1[col].isin( df2[col].values ) ]
        dfs.insert(0,df0)
        length = len(dfs)
    return dfs

The suggestion to use merge wont work because it mangles the index and the column heads.

This is what merge gives:

>     [   open_x_x  high_x_x  low_x_x  close_x_x  volume_x_x Ticker  ...  LowAboveShort_y_y  ShortAboveLong_y_y  Return_y_y  DayDiff_y_y 
> AboveBelow_y_y  ShortToLong_y_y
>     0     52.60     52.68    52.24    52.4779        7632   AADR  ...            0.28214            1.087176    0.043298     2.600000             2.0         8.000000
>     1     14.03     14.03    14.03    14.0300         359   AAMC  ...            0.17472            0.628733    0.202228     1.333333             7.0         2.600000
>     2      2.15      2.15     1.72     1.9500       10095   AAME  ...           -0.20068            0.107564    0.114286     1.000000             1.0         0.636364
>     
>     [3 rows x 61 columns]]

And this is what the code below gives:

>     [             open   high    low    close  volume Ticker  Difference     LongMA   ShortMA  HighBelowShort  LowAboveShort 
> ShortAboveLong    Return   DayDiff  AboveBelow  ShortToLong
>     timestamp                                                                                                                                                                           
>     2019-12-12  52.60  52.68  52.24  52.4779    7632   AADR      0.1379  50.870684  51.95786         0.72214        0.28214        1.087176  0.043298  2.600000         2.0     8.000000
>     2019-12-12  14.03  14.03  14.03  14.0300     359   AAMC     -0.0100  13.226547  13.85528         0.17472        0.17472        0.628733  0.202228  1.333333         7.0     2.600000
>     2019-12-12   2.15   2.15   1.72   1.9500   10095   AAME      0.1900   1.813116   1.92068         0.22932       -0.20068        0.107564  0.114286  1.000000         1.0     0.636364]

Notice how the timestamp is maintained as are the column headings.

With no other ideas I am accepting my answer. Using merge alters the headings and to merge thousands of dataframes creates completely unrecognizable headers. Also, I couldn't figure out how to use merge to keep the column index intact without recreating the entire df. Using my method on a list of dfs of ~250kb takes only a few microseconds longer per df which is an acceptable level of compute cost for the solution. — Ryan Hunt, Dec 26 '19 at 19:28

How to find the intersection of multiple pandas dataframes on a non index column

1 Answers1