Remove rows if do exist in another dataframe - python pandas

Question

I have two large datasets:

training size: 289816 rows X 689 columns
testing size: 49863 rows X 689 columns

I want to drop some rows of testing set as they already exist in the training.

I checked the following answer https://stackoverflow.com/a/44706892

but unfortunately python processes are killed as 144 gigabit of memory are filled.

Are there any better solution that is not resource consuming?

check this answer https://stackoverflow.com/a/44706998/14289892 — Anurag Dabas, Jul 11 '21 at 14:38

Alexander Martins · Answer 1 · 2021-12-16T19:15:31.980

I would recommend a piecewise solution, something similar to the following pseudo-code:

chunksize = 10**5

for test_chunk in pd.read_csv(test_set_path, delimiter=';', dtype=str, chunksize=chunksize):
    
    # In this loop all the elements that were in testing are filtered
    for train_chunk in pd.read_csv(train_set_path, delimiter=';', dtype=str, chunksize=chunksize):
        test_chunk = test_chunk.loc[~train_chunk.index]

    test_chunk.loc.to_csv('path/to/your/result/csv', mode='a', index=False)

This way you don't get out of memory...

Remove rows if do exist in another dataframe - python pandas

1 Answers1