Find Duplicates in large dataset using python

Question

I want to find duplicates from a large json file(12gb) in python. Now I read the whole file, use set() as a look up table and store unique values and then write the duplicates to a file.

But if the input data is of 100gb, the set() will not be able to handle the unique values and hence my code is not scalabe.

Any idea how I can do it alternatively?

Using python

import json
import time

""" This calss contains the business logic for identifying the duplicates and creating an output file for further processing """

class BusinessService:

    """ The method identiifes the duplicate """

    def service(ipPath,opPath):
        start_time = time.time()    #We start the timer 
        uniqueHandleSet = set();     #Creating a set to store unique values #

        try:
            # Opening and creating an output file to catch the duplicate hanndles #                     
            duplicateHandles = open(opPath,'w+',encoding='utf-8') 
            #Reading the JSON File by buffering and using 20mb as it is too big to read at once #       
            with open(ipPath,buffering = 200000000, encoding = 'utf-8') as infile:     
                for line in infile:
                    tweetJsonObject = json.loads(line);
                    if tweetJsonObject["name"] not in uniqueHandleSet:
                        uniqueHandleSet.add(tweetJsonObject["name"]);
                    else:
                        duplicateHandles.write(line);  
            print("--- %s seconds --- memory 200mb while buffering" % (time.time() - start_time));  #Printing the total time required to execute 
        except:
            print("Error")

        finally:
            duplicateHandles.close();

I need to use an alternative for set() for the lookup

Why is `set()` not handling your situation? Are you running out of memory from the size of the set object? — James, Dec 23 '18 at 02:42
I don't have enough reputation to comment, but look that this [answer](https://stackoverflow.com/questions/44191465/efficiently-identify-duplicates-in-large-list-500-000) — wooohooo, Dec 23 '18 at 03:45
Have you looked into using something like map-reduce from PySpark? — user124384, Dec 23 '18 at 03:50
Hey James, the problem with set() is that it wont be able to store enough in-memory data if the input file is 100gb+ and I did not venture into PySpark becaus it requires me to use native python libraries but pySpark would require my laptop to have spark, hadoop environment as well right? — Mohit Ruke, Dec 23 '18 at 05:02
Are you willing and able to work with a database? You will almost certainly need one for 100+ GB of data. Moreover, you will almost certainly need to batch-process the DB data back to JSON. If this is in your wheelhouse, someone can help. — Joseph8th, Dec 23 '18 at 07:36
Actually its a challenge that I have participated in, my code works efficiently on a 12gb fiile since the set() can handle it, but it needs to be scaled so that it can handle a 100gb file where the set() wont work for a lookup. Would you recommend me using database or something like pyspark where I can use the RDD for lookup — Mohit Ruke, Dec 23 '18 at 20:06

Find Duplicates in large dataset using python

0 Answers0