1

I have two large datasets:

  • training size: 289816 rows X 689 columns

  • testing size: 49863 rows X 689 columns

I want to drop some rows of testing set as they already exist in the training.

I checked the following answer https://stackoverflow.com/a/44706892

but unfortunately python processes are killed as 144 gigabit of memory are filled.

Are there any better solution that is not resource consuming?

Ahmed
  • 39
  • 2

1 Answers1

0

I would recommend a piecewise solution, something similar to the following pseudo-code:

chunksize = 10**5

for test_chunk in pd.read_csv(test_set_path, delimiter=';', dtype=str, chunksize=chunksize):
    
    # In this loop all the elements that were in testing are filtered
    for train_chunk in pd.read_csv(train_set_path, delimiter=';', dtype=str, chunksize=chunksize):
        test_chunk = test_chunk.loc[~train_chunk.index]

    test_chunk.loc.to_csv('path/to/your/result/csv', mode='a', index=False)

This way you don't get out of memory...