0

I have two incredibly large dataframes, df1 and df2. Their sizes are below:

print(df1.shape) #444500 x 3062
print(df2.shape) #254232 x 3062

I know that each value of df2 appears in df1, and what I am looking to do is build a third dataframe that is the difference of the two, meaning, all of the rows that appear in df1 that do not appear in df2.

I have tried using the below method from this question:

df3 = (pd.merge(df2,df1, indicator=True, how='outer')
            .query('_merge=="left_only"').drop('_merge', axis=1))

But am continually getting MemoryError failures due to this

Thus, I am now trying to do the following:

  1. Loop through each row of df1
  2. See if df1 appears in df2
  3. If it does, skip
  4. If not, add it to a list

What I am concerned about, in terms of rows, is that the rows of data are equal, meaning, all of the elements match pairwise, for example

[1,2,3]
[1,2,3]

is a match, while:

[1,2,3]
[1,3,2]

is not a match

I am now trying:

for i in notebook.tqdm(range(svm_data.shape[0])):
    real_row = np.asarray(real_data.iloc[[i]].to_numpy())
    synthetic_row = np.asarray(svm_data.iloc[[i]].to_numpy())
    if (np.array_equal(real_row, synthetic_row)):
        continue
    else:
        list_of_rows.append(list(synthetic_row))
    gc.collect()

But for some reason, this is not finding the values in the rows themselves, so I am clearly still doing something wrong.

Note, I also tried: df3 = df1[~df1.isin(df2)].dropna(how='all')

but that yielded incorrect results.

How can I (in a memory efficient way) find all of the rows in one of my dataframe

DATA

df1:

1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,2

df2:

1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,3
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,2.0,2
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,1,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,5,0,0,0,0,0.0,4
artemis
  • 6,857
  • 11
  • 46
  • 99
  • do you need to compare all the columns to conclude items are same/different, or can you get the same results when comparing, say, 5 columns? – warped Jun 30 '20 at 19:18
  • @warped unfortunately, all columns need to be compared. I know it is a very big feature space. The 3-element example I posted was more for clarity sake. – artemis Jun 30 '20 at 19:19
  • Does this answer your question? [set difference for pandas](https://stackoverflow.com/questions/18180763/set-difference-for-pandas). Specifically `pd.concat([df2, df1]).drop_duplicates(keep=False)`. – Trenton McKinney Jun 30 '20 at 19:38

1 Answers1

1

Let's try concat and groupby to identify duplicate rows:

# sample data
df1 = pd.DataFrame([[1,2,3],[1,2,3],[4,5,6],[7,8,9]])
df2 = pd.DataFrame([[4,5,6],[7,8,9]])

s = (pd.concat((df1,df2), keys=(1,2))
       .groupby(list(df1.columns))
       .ngroup()
    )

# `s.loc[1]` corresponds to rows in df1
# `s.loc[2]` corresponds to rows in df2
df1_in_df2 = s.loc[1].isin(s.loc[2])

df1[df1_in_df2]

Output:

   0  1  2
2  4  5  6
3  7  8  9

Update Another option is to merge on the non-duplicated df2:

df1.merge(df2.drop_duplicates(), on=list(df1.columns), indicator=True, how='left')

Output (you should be able to guess which rows you need from there):

   0  1  2     _merge
0  1  2  3  left_only
1  1  2  3  left_only
2  4  5  6       both
3  7  8  9       both
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • In this case, what does `keys` do, and even though my actual size is `3062`, must I change those values, or the values in `s.loc[2].isin(s.loc[1])`? – artemis Jun 30 '20 at 19:24
  • No, you don't need to updated the keys. `keys` to indicate which dataframe `(df2, df1)` does the row coming from. So `s.loc[2]` indicate the second one, which is `df1` in this case. – Quang Hoang Jun 30 '20 at 19:25
  • @wundermahn see updated answer for a more natural order of `keys`. – Quang Hoang Jun 30 '20 at 19:27
  • `df3 = df1.merge(df2.drop_duplicates(), on=list(df1.columns), indicator=True, how='left')`? – artemis Jun 30 '20 at 19:35