I need to remove duplicates for a large .csv file (50+GB), and I would like to do this using python. Several other questions address the issue broadly (ex: here and here), but they deal with exact duplicates.
In my case, the duplicates are not exact duplicates. The setup of my file is such that I compiled this file by pulling rows from several sources, and one column indicates the source of origin. This means that I would like to remove duplicates for a subset of columns. The size of the file means that I cannot load it into memory, so pandas is out.
How can I approach this problem (possibly modifying the solutions I linked to)?