I have two incredibly large dataframes
, df1
and df2
. Their sizes are below:
print(df1.shape) #444500 x 3062
print(df2.shape) #254232 x 3062
I know that each value of df2
appears in df1
, and what I am looking to do is build a third dataframe that is the difference of the two, meaning, all of the rows that appear in df1
that do not appear in df2
.
I have tried using the below method from this question:
df3 = (pd.merge(df2,df1, indicator=True, how='outer')
.query('_merge=="left_only"').drop('_merge', axis=1))
But am continually getting MemoryError
failures due to this
Thus, I am now trying to do the following:
- Loop through each row of df1
- See if df1 appears in df2
- If it does, skip
- If not, add it to a list
What I am concerned about, in terms of rows, is that the rows of data are equal, meaning, all of the elements match pairwise, for example
[1,2,3]
[1,2,3]
is a match, while:
[1,2,3]
[1,3,2]
is not a match
I am now trying:
for i in notebook.tqdm(range(svm_data.shape[0])):
real_row = np.asarray(real_data.iloc[[i]].to_numpy())
synthetic_row = np.asarray(svm_data.iloc[[i]].to_numpy())
if (np.array_equal(real_row, synthetic_row)):
continue
else:
list_of_rows.append(list(synthetic_row))
gc.collect()
But for some reason, this is not finding the values in the rows themselves, so I am clearly still doing something wrong.
Note, I also tried:
df3 = df1[~df1.isin(df2)].dropna(how='all')
but that yielded incorrect results.
How can I (in a memory efficient way) find all of the rows in one of my dataframe
DATA
df1:
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,2
df2:
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,3
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,2.0,2
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,1,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,5,0,0,0,0,0.0,4