Removing duplicates from a large csv file

Question

I have a large csv around 24 million rows, and I want to cut in size.

Here is a little preview of a csv:

I want to remove the rows that have the same CIK and IP, because I have a bunch of these files and they take up a lot of space, so I want to make an efficient way to remove the duplicates.

I've made to test how many duplicates of CIK are there, and for some there are more then 100k, that is why I want to cut those duplicates out.

I've tried some stuff but in most cases it failed, because of the size of the csv.

What do you mean by *it failed*? does the program crash or you are getting some kind of error? — Nazim Kerimbekov, May 03 '19 at 21:13
The programs doesn't work I got errors but when I fixed them the program just didn't work, its just loops in infinity, that why i posted so I can get some suggestion on how to do it — Stefan, May 03 '19 at 21:14
I've made a test script to count how many of the same CIK there are and for some there more then 100k, that is the reason to remove some of them — Stefan, May 03 '19 at 21:15
are you successfully able to read it into pandas then the drop_duplicates operation breaks? or you can't successfully create a df? — Ben Pap, May 03 '19 at 21:16
Can you post what you tried already to see why it loops in infinity? — solarc, May 03 '19 at 21:20
I didn't use pandas, but I will try, because it seems to be only suggested solution — Stefan, May 03 '19 at 21:21

perl · Answer 1 · 2019-05-03T22:03:56.440

6

Another quick way is to do it with awk, running from the command line:

awk -F, '!x[$1,$5]++' file.csv > file_uniq.csv

where file.csv is the name of your file, and file_uniq.csv is where you want to have your deduplicated records ($1 and $5 are column numbers, 1 for ip and 5 for cik)

P.S. You should have awk if you're on a Linux/Mac, but may need to download it separately on Windows

edited May 03 '19 at 22:03

answered May 03 '19 at 21:54

perl

9,826
1
10
22

1

`awk` is definitely a good option, and will almost certainly out-perform pandas. – PMende May 03 '19 at 22:16
1

Though it is off topic but the brilliant answer and helped to sort my issue – Volatil3 Aug 10 '19 at 14:07
1

Thanks for it. I was also processing a large file (more than 4 GB) on a RAM of 32 GB but it was being killed. awk saved me. – Deepak Uniyal Feb 12 '21 at 04:33
I get an invalid syntax error for this solution :( – Emi OB Feb 08 '22 at 11:27

score 1 · Answer 2 · answered May 03 '19 at 21:25

1

Here is an example using pandas and reduce:

from functools import reduce

import pandas as pd

df = reduce(
    lambda df_i, df_j: pd.concat([df_i, df_j])
                         .drop_duplicates(subset=["cik", "ip"]),
    pd.read_csv("path/to/csv", chunksize=100000)
)
df.to_csv("path/to/deduplicated/csv")

This avoids opening the entire file at once (opening it in 100000 line chunks instead), and dropping duplicates as it goes.

answered May 03 '19 at 21:25

PMende

5,171
2
19
26

1

I think this is a good solution, will try it out, thanks @PMende – Stefan May 03 '19 at 21:32
I get this error when trying to run it : "sys:1: DtypeWarning: Columns (14) have mixed types. Specify dtype option on import or set low_memory=False." – Stefan May 03 '19 at 21:35
@Stefan If you know the types of all of your columns, you can specify them in the call to `pd.read_csv` (see docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) by passing a `dict` to the `dtype` parameter. For example, you can say something like: `dtypes = {"ip": "str", "date": "str", }`. Then in the `read_csv` call, you just add the parameter `dtype=dtypes`. Generally you can ignore the `DtypeWarning`, and pandas does a good job of inferring what types to use. – PMende May 03 '19 at 22:14
Won't this cause an OOM error if the deduplicated file is too big to fit in memory? – user3810965 Oct 07 '22 at 14:07

score -5 · Answer 3 · edited May 03 '19 at 21:28

-5

You can do the following:

import pandas as pd

df = pd.read_csv('filepath/filename.csv', sep='your separator', header = True, index=False)
df.drop_duplicates(subset=['cik','ip'], keep=False, inplace=True)
df.to_csv('filepath/new_filename.csv', sep= 'your separator', header=True, index=False)

and enjoy your csv without the duplicates.

edited May 03 '19 at 21:28

Al.G.

4,327
6
31
56

answered May 03 '19 at 21:18

T. Novais

79
8

I don't know will pandas work, because of the size of the csv – Stefan May 03 '19 at 21:19

Removing duplicates from a large csv file

3 Answers3

Linked