How to delete duplicates from one specific column in python with 20 milion rows

Question

I would like to delete duplicates from one large csv. I have this csv format of data

client_id;gender;age;profese;addr_cntry;NAZOKRESU;prijem_AVG_6M_pasmo;cont_id;main_prod_id;bal_actl_am_pasmo
388713248;F;80;důchodce;CZ;Czech;;5715125;39775;
27953927;M;28;Dělník;CZ;Opavia;22;4427292;39075;

I need delete all duplicates from client_id. I can not handle this big file in python with Pandas. I tried dask, but same result. Just infinity time of waiting and nothing really happend.

Here is my last version of code


import dask.dataframe as dd
import chardet
from dask.diagnostics import ProgressBar

with open('bigData.csv', 'rb') as f:
    result = chardet.detect(f.read())

df = dd.read_csv('bigData.csv', encoding=result['encoding'], sep=';')

total_rows = df.shape[0].compute()

df = df.drop_duplicates(subset=['client_id'], keep=False, Inplace=True)

df.to_csv('bigData.csv', sep=';', index=False)

total_duplicates = total_rows - df.shape[0].compute()

print(f'Was deleted {total_duplicates} duplicated rows.')

I tried it with progress bar and nothing really happened. Thanks for help!

I have solved duplicate-removal with AWK a few times, it works well also with very large files. A python way to solve the problem is [described here](https://stackoverflow.com/questions/3452832/remove-duplicate-rows-from-a-large-file-in-python). — MyICQ, Feb 27 '23 at 11:29
Thanks for comment, AWS is out of my scope. I am windows user, but thanks for trying! — David Kopl, Feb 27 '23 at 11:49
I am a Windows user too. AWK is [available for Windows](https://sourceforge.net/projects/gnu-awk-for-windows/) and works 99% like on Linux / Mac. Have you tried the linked suggestion in the other question, using SQLite and key index ? — MyICQ, Feb 27 '23 at 12:21
The solution I linked to using SQLite is **significantly** slower, so not ideal for larger datasets. — MyICQ, Mar 01 '23 at 08:02

Zach Young · Accepted Answer · 2023-02-28T00:53:13.023

You might be able get away with a very simple Python program that stores every new ID it sees in a dict, and skips writing a subsequent row if it finds that row's ID already in the dict. It should require about 2GB of RAM.

import csv

reader = csv.reader(open("input.csv", newline=""))
writer = csv.writer(open("output.csv", "w", newline=""))

writer.writerow(next(reader))  # transfer header, if you have one

ids = {}
for row in reader:
    if row[0] not in ids:
        writer.writerow(row)
        ids[row[0]] = None  # add ID to "list" of already written IDs

This approach:

Uses a dict, ids, to hold all IDs the program has already encountered and written; dicts can do really fast lookups/checks for its keys (your IDs).
Keeps the original ordering of the rows.

I mocked up a CSV w/20M rows (with randomly generated IDs between 0 and 20M), that looks something like this:

| id       | i |
|----------|---|
| 2266768  | 0 |
| 15245359 | 1 |
| 16304974 | 2 |
| 4801643  | 3 |
| 9612409  | 4 |
| 17659151 | 5 |
| 15824934 | 6 |
| 4101873  | 7 |
| 12282127 | 8 |
| 5172219  | 9 |

I ran it through that program and ended up with 12.6M rows. On my Macbook Air M1 (dual-channel SSD) that took 14 seconds and consumed 1.5GB of RAM. The RAM is needed to hold all the previously seen IDs.

Also, I see you reading the entire file first to detect the character encoding:

Have you tried running chardetect from the command line, chardetect input.csv, and just hard-coding the value returned?

Have you experimented with reading a much smaller portion of the file and seeing what results and confidence you get?

with open("input.csv", "rb") as f:
    input_enc = chardet.detect(f.read(1024 * 64))  # only read first 64K

print(input_enc)  # {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

Since you already have the source file: out of interest, could you do a performance test (RAM/ time) with the SQLite (external file) solution, and possibly doing it in AWK ? I can do it myself, but comparing from same source would make sense. — MyICQ, Feb 28 '23 at 08:38
Hi @MyICQ, I don't know AWK, so I couldn't do a fair comparison. You seem to favor AWK, will you take my answer and write your best AWK script and compare? I'm personally curious how their performance compares, in general, but I would never consider AWK for a general-purpose CSV parser. [Here's the small Python script](https://gist.github.com/zacharysyoung/da51b614acc7f8e97b249d1f47302900#file-gen-20m-random-ids-py) I used to generate the 20M random IDs. Thanks! — Zach Young, Feb 28 '23 at 13:35
Not trying to be in absolut favor of AWK, just curious about which way (python/dict, python/SQLite, AWK, ... ) is faster. I will update later. — MyICQ, Feb 28 '23 at 13:50

score 1 · Answer 2 · answered Feb 28 '23 at 16:09

Same task using AWK. This is not what op asked, just to complete comment above. Do not accept as answer.

BEGIN{
  FS=","   # set field separator to comma
}
!seen[$2]++ {    # is field 2 not seen before ? 
    print $0
}

Sample data:

RowNum,ID
1,5220607
2,8632078
3,8323076
..

Run as c:\>awk -f script.awk input.csv > uniquevalues.csv

This outputs about 12 mio rows, and consumes 1,8GB memory in about 18 seconds (i7 Windows).

The python script from @zach-young above on same computer and file was about 35 seconds, but less memory.

How to delete duplicates from one specific column in python with 20 milion rows

2 Answers2