I can't read a whole 5 GB CSV file in one go, but using Pandas' read_csv()
with chuncksize
set seems to be a fast and easy way:
import pandas as panda
def run_pand(csv_db):
reader = panda.read_csv(csv_db, chunksize=5000)
dup=reader.duplicated(subset=["Region","Country","Ship Date"])
#and after i will write duplicates in new csv
As I understand it, reading in chunks will not let me find a duplicate if they are in different pieces, or will it still?
Is there a way to search for matches using a Pandas method?