Low resources usage when using dedupe python

Question

I need to find duplicates in a large dataset, so I'm testing dedupe python library.

I know it is recommended for small datasets, so I thought using a good machine could improve the performance. I have a machine with 56 GB RAM and I'm running a test similar to "csv_example" for a dataset with 200000 rows. It works but the memory usage is very low and so the processing(CPU).

It seems to take too long in the blocking stage:

INFO:dedupe.blocking:10000, 110.6458142 seconds
INFO:dedupe.blocking:20000, 300.6112282 seconds
INFO:dedupe.blocking:30000, 557.1010122 seconds
INFO:dedupe.blocking:40000, 915.3087222 seconds

Could anyone help me to improve the usage or tell me if there is any library/setting that makes the program use more available resources?

score 3 · Accepted Answer · answered Jun 12 '17 at 00:52

3

What version of dedupe are your running? As of 1.6.8, it should handle a record set of this size pretty easily.

However, the general guidance is that when your run into memory problems, switch to do blocking with a database like in the postgres example.

(I'm a main author of dedupe).

answered Jun 12 '17 at 00:52

fgregg

3,173
30
37

Thanks! dedupe is a great library. is posible use this library in spark to dedupe 3000000 records? :) – mjimcua Jun 16 '17 at 17:51
2

I don't know anything about spark, but dedupe should be able to handle 3 millions records pretty easily. – fgregg Jun 17 '17 at 00:14
It takes over 20mins to run the Postgres example you mentioned above. Am I doing something wrong? or is this to be expected? It would be really great to speed this up. – Crimson_Hawk Jul 15 '21 at 15:24

score 0 · Answer 2 · answered Aug 10 '22 at 12:14

0

I have succesfully used dedupe python library to deduplicate 16 million records, however 16M records will not fit into computer memory so I deduped by X number of records at a time (via postgres) ie. at 10,000 records the worst case scenario is maxing out the RAM at 64GB of our EC2 instance.

https://ronbeltran.pages.dev/2022/08/using-python-dedupe-library-millions-records/

answered Aug 10 '22 at 12:14

Ronnie Beltran

614
1
8
21

doesn't the dedupe quality go down by processing the items in chunks? – flash May 27 '23 at 19:50

Low resources usage when using dedupe python

2 Answers2