2

I am using postgres dedupe example code. For 10,000 rows, it is consuming 163 seconds. I found that it is consuming most of the time in this part:

full_data = []
cluster_membership = collections.defaultdict(lambda : 'x')
for cluster_id, (cluster, score) in enumerate(clustered_dupes):
    for record_id in cluster:
        for row in data:
            if record_id == int(row[0]):
                row = list(row)
                row.insert(0,cluster_id)
                row = tuple(row)
                full_data.append(row)

Is there any possible optimization for this part such that it produces the same result in less time complexity? Will this script work for 150 million records?

James Z
  • 12,209
  • 10
  • 24
  • 44
Shubham Singh
  • 91
  • 2
  • 12
  • Consider doing the necessary duplication with SQL rather than in the Python code. [This post](https://stackoverflow.com/questions/2230295/whats-the-best-way-to-dedupe-a-table) might be able to help you get started – Daniel Corin Aug 01 '17 at 03:40
  • where `data` comes from? also `cluster_membership` seems to be unused in this snippet – Azat Ibrakov Aug 01 '17 at 03:46
  • @danielcorin I have gone through the link. There every solution is to remove duplicates but i want to find cluster of duplicate records from a postgres table instead to removing them. – Shubham Singh Aug 01 '17 at 05:03
  • @AzatIbrakov 'data' stores the result of select query on the postgres table. I have removed that 'cluster_membership' statement but still time taken is 163 seconds same as earlier. – Shubham Singh Aug 01 '17 at 05:16
  • provide us example of input and desired output – Azat Ibrakov Aug 01 '17 at 05:17

0 Answers0