pandas df remove items from df1 that are also in df2

Question

I have two very large csv files. They are both only one col with integers. I need to check for every integer in dfA if they are in dfB. If so, I need to remove item a from dfA.

I would probably loop through dfA and check for every value if in dfB, but looping is wayyyy too slow.

dfA :

        0
0  9312969810
1  3045897298
2  8162414592
3  2030000000
4  7876904982

dfB:

        0
0  2030000000
1  2030156119
2  2030389149
3  2030641047
4  2030693850

output:

        0
0  2030156119
1  2030389149
2  2030641047
3  2030693850

Since 2030000000 is in dfB, we need to remove from dfA.

Does anyone have a better way. Thanks

edit: csv for dfB is 2gb and dfA is 5mb

Based on your input, you need `dfB[~dfB.isin(dfA).values]`. If this doesn't work, you'll need to update your input so we can reproduce your problem. — cs95, Sep 25 '17 at 18:36
Related and probable dupe: https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe — EdChum, Sep 26 '17 at 08:04

score 0 · Answer 1 · answered Sep 25 '17 at 18:37

There's no 'magic bullet' here, you'll have to loop through each list at least once

You can iterate through just one of the lists as follows (though, i think under the hood, we iterate through both lists)

dfA = pd.read_csv(file1)
dfB = pd.read_csv(file2)

for n in dfB.values:
    dfA = dfA[dfA[0] != n]

Alternative, what Zero said, but I think that's still (under the hood) doing (more efficient) looping

dfA[~dfA[0].isin(dfB[0])]

pandas df remove items from df1 that are also in df2

1 Answers1