Have a collection of ~10 million GZipped CSV files, each one having anywhere from 100-1000 rows and >2000 columns. Each file also contains a header.
In each CSV file there are two important columns, "ID" and "target".
I'm trying to remove the rows with duplicate "target" but retain the ID from the row to be removed with the row that will not be removed.
E.g.
Input:
CSV1
| ID | Target |
|-------|------------------------------|
| IX213 | C1=CC(=CC=C1CC(=O)C(=O)O)O |
| IX412 | CN1C=NC2=C1C(=O)N(C(=O)N2C)C |
CSV2
| ID | Target |
|-------|------------------------------|
| BC144 | CN1C=NC2=C1C(=O)N(C(=O)N2C)C |
| BC155 | C(CC(=O)O)C(C(=O)O)N |
Output:
CSV1*
| ID | Target |
|--------------|------------------------------|
| IX213 | C1=CC(=CC=C1CC(=O)C(=O)O)O |
| IX412; BC144 | CN1C=NC2=C1C(=O)N(C(=O)N2C)C |
CSV2*
| ID | Target |
|-------|------------------------------|
| BC155 | C(CC(=O)O)C(C(=O)O)N |
This would be straightforward for a small number of files with Pandas (Python) or the like, but was hoping someone might have a much better way of doing it across millions of files with billions of entries.