What is an efficient way to find duplicates over a csv file contains millions of entries using Python?

Asked Jan 26 '19 at 02:40

Active Jan 26 '19 at 02:40

Viewed 186 times

I have a csv file contains million of rows. I want to convert it to a sql style database using python's sqlite library. The table is required to have certain column as primary key, but that column contains some duplicates and I need to delete the column with duplicates. Right now I built a set and do look up everytime which takes O(n) and O(n) space, is there any more effiecient way to find the duplicate column from millions of entries in terms of time and space complexity?

asked Jan 26 '19 at 02:40

RioAraki

Related: [Python CSV to SQLite](https://stackoverflow.com/q/5942402/190597) and [Deleting duplicate rows from sqlite database](https://stackoverflow.com/q/8190541/190597). – unutbu Jan 26 '19 at 03:50
*"Right now I built a set and do look up everytime"*: [Edit] your Question and show how you doing that. – stovfl Jan 26 '19 at 07:40
Er, just use `INSERT OR IGNORE` when adding rows? – Shawn Jan 26 '19 at 08:05
Though I'd just create the table and use the sqlite3 shell's [CSV import](https://www.sqlite.org/cli.html#csv_import) functionality. It'll spam you with warnings about duplicate keys but skip those rows. No need for python unless you're doing some sort of preprocessing of the data and not a straight import. – Shawn Jan 26 '19 at 08:08

What is an efficient way to find duplicates over a csv file contains millions of entries using Python?

0 Answers0