-2

I have two very large csv files. They are both only one col with integers. I need to check for every integer in dfA if they are in dfB. If so, I need to remove item a from dfA.

I would probably loop through dfA and check for every value if in dfB, but looping is wayyyy too slow.

dfA :

        0
0  9312969810
1  3045897298
2  8162414592
3  2030000000
4  7876904982

dfB:

        0
0  2030000000
1  2030156119
2  2030389149
3  2030641047
4  2030693850

output:

        0
0  2030156119
1  2030389149
2  2030641047
3  2030693850

Since 2030000000 is in dfB, we need to remove from dfA.

Does anyone have a better way. Thanks

edit: csv for dfB is 2gb and dfA is 5mb

VincFort
  • 1,150
  • 12
  • 29
  • 1
    Try `dfB[~dfB['colname'].isin(dfA['colname'])]` – Zero Sep 25 '17 at 18:27
  • Can you please show df1, df2, and some sample output? – cs95 Sep 25 '17 at 18:27
  • Based on your input, you need `dfB[~dfB.isin(dfA).values]`. If this doesn't work, you'll need to update your input so we can reproduce your problem. – cs95 Sep 25 '17 at 18:36
  • Related and probable dupe: https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe – EdChum Sep 26 '17 at 08:04

1 Answers1

0

There's no 'magic bullet' here, you'll have to loop through each list at least once

You can iterate through just one of the lists as follows (though, i think under the hood, we iterate through both lists)

dfA = pd.read_csv(file1)
dfB = pd.read_csv(file2)

for n in dfB.values:
    dfA = dfA[dfA[0] != n]

Alternative, what Zero said, but I think that's still (under the hood) doing (more efficient) looping

dfA[~dfA[0].isin(dfB[0])]
Mohammad Athar
  • 1,953
  • 1
  • 15
  • 31