Python Postgresql dedupe consuming a lot of time. Can there be any optimization?

Asked Aug 01 '17 at 03:22

Active Aug 01 '17 at 14:53

Viewed 288 times

I am using postgres dedupe example code. For 10,000 rows, it is consuming 163 seconds. I found that it is consuming most of the time in this part:

full_data = []
cluster_membership = collections.defaultdict(lambda : 'x')
for cluster_id, (cluster, score) in enumerate(clustered_dupes):
    for record_id in cluster:
        for row in data:
            if record_id == int(row[0]):
                row = list(row)
                row.insert(0,cluster_id)
                row = tuple(row)
                full_data.append(row)

Is there any possible optimization for this part such that it produces the same result in less time complexity? Will this script work for 150 million records?

edited Aug 01 '17 at 14:53

James Z

12,209
10
24
44

asked Aug 01 '17 at 03:22

Shubham Singh

Consider doing the necessary duplication with SQL rather than in the Python code. [This post](https://stackoverflow.com/questions/2230295/whats-the-best-way-to-dedupe-a-table) might be able to help you get started – Daniel Corin Aug 01 '17 at 03:40
where `data` comes from? also `cluster_membership` seems to be unused in this snippet – Azat Ibrakov Aug 01 '17 at 03:46
@danielcorin I have gone through the link. There every solution is to remove duplicates but i want to find cluster of duplicate records from a postgres table instead to removing them. – Shubham Singh Aug 01 '17 at 05:03
@AzatIbrakov 'data' stores the result of select query on the postgres table. I have removed that 'cluster_membership' statement but still time taken is 163 seconds same as earlier. – Shubham Singh Aug 01 '17 at 05:16
provide us example of input and desired output – Azat Ibrakov Aug 01 '17 at 05:17

Python Postgresql dedupe consuming a lot of time. Can there be any optimization?

0 Answers0