0

I have a large csv around 24 million rows, and I want to cut in size.

Here is a little preview of a csv:

enter image description here

I want to remove the rows that have the same CIK and IP, because I have a bunch of these files and they take up a lot of space, so I want to make an efficient way to remove the duplicates.

I've made to test how many duplicates of CIK are there, and for some there are more then 100k, that is why I want to cut those duplicates out.

I've tried some stuff but in most cases it failed, because of the size of the csv.

Stefan
  • 375
  • 1
  • 9
  • 24
  • 2
    What do you mean by *it failed*? does the program crash or you are getting some kind of error? – Nazim Kerimbekov May 03 '19 at 21:13
  • How many duplicates do you have per `CIK`/`IP` combination? – PMende May 03 '19 at 21:14
  • The programs doesn't work I got errors but when I fixed them the program just didn't work, its just loops in infinity, that why i posted so I can get some suggestion on how to do it – Stefan May 03 '19 at 21:14
  • I've made a test script to count how many of the same CIK there are and for some there more then 100k, that is the reason to remove some of them – Stefan May 03 '19 at 21:15
  • are you successfully able to read it into pandas then the drop_duplicates operation breaks? or you can't successfully create a df? – Ben Pap May 03 '19 at 21:16
  • Can you post what you tried already to see why it loops in infinity? – solarc May 03 '19 at 21:20
  • I didn't use pandas, but I will try, because it seems to be only suggested solution – Stefan May 03 '19 at 21:21

3 Answers3

6

Another quick way is to do it with awk, running from the command line:

awk -F, '!x[$1,$5]++' file.csv > file_uniq.csv

where file.csv is the name of your file, and file_uniq.csv is where you want to have your deduplicated records ($1 and $5 are column numbers, 1 for ip and 5 for cik)

P.S. You should have awk if you're on a Linux/Mac, but may need to download it separately on Windows

perl
  • 9,826
  • 1
  • 10
  • 22
1

Here is an example using pandas and reduce:

from functools import reduce

import pandas as pd

df = reduce(
    lambda df_i, df_j: pd.concat([df_i, df_j])
                         .drop_duplicates(subset=["cik", "ip"]),
    pd.read_csv("path/to/csv", chunksize=100000)
)
df.to_csv("path/to/deduplicated/csv")

This avoids opening the entire file at once (opening it in 100000 line chunks instead), and dropping duplicates as it goes.

PMende
  • 5,171
  • 2
  • 19
  • 26
  • 1
    I think this is a good solution, will try it out, thanks @PMende – Stefan May 03 '19 at 21:32
  • I get this error when trying to run it : "sys:1: DtypeWarning: Columns (14) have mixed types. Specify dtype option on import or set low_memory=False." – Stefan May 03 '19 at 21:35
  • @Stefan If you know the types of all of your columns, you can specify them in the call to `pd.read_csv` (see docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) by passing a `dict` to the `dtype` parameter. For example, you can say something like: `dtypes = {"ip": "str", "date": "str", }`. Then in the `read_csv` call, you just add the parameter `dtype=dtypes`. Generally you can ignore the `DtypeWarning`, and pandas does a good job of inferring what types to use. – PMende May 03 '19 at 22:14
  • Won't this cause an OOM error if the deduplicated file is too big to fit in memory? – user3810965 Oct 07 '22 at 14:07
-5

You can do the following:

import pandas as pd

df = pd.read_csv('filepath/filename.csv', sep='your separator', header = True, index=False)
df.drop_duplicates(subset=['cik','ip'], keep=False, inplace=True)
df.to_csv('filepath/new_filename.csv', sep= 'your separator', header=True, index=False)

and enjoy your csv without the duplicates.

Al.G.
  • 4,327
  • 6
  • 31
  • 56
T. Novais
  • 79
  • 8