I want to find duplicates from a large json file(12gb) in python. Now I read the whole file, use set() as a look up table and store unique values and then write the duplicates to a file.
But if the input data is of 100gb, the set() will not be able to handle the unique values and hence my code is not scalabe.
Any idea how I can do it alternatively?
Using python
import json
import time
""" This calss contains the business logic for identifying the duplicates and creating an output file for further processing """
class BusinessService:
""" The method identiifes the duplicate """
def service(ipPath,opPath):
start_time = time.time() #We start the timer
uniqueHandleSet = set(); #Creating a set to store unique values #
try:
# Opening and creating an output file to catch the duplicate hanndles #
duplicateHandles = open(opPath,'w+',encoding='utf-8')
#Reading the JSON File by buffering and using 20mb as it is too big to read at once #
with open(ipPath,buffering = 200000000, encoding = 'utf-8') as infile:
for line in infile:
tweetJsonObject = json.loads(line);
if tweetJsonObject["name"] not in uniqueHandleSet:
uniqueHandleSet.add(tweetJsonObject["name"]);
else:
duplicateHandles.write(line);
print("--- %s seconds --- memory 200mb while buffering" % (time.time() - start_time)); #Printing the total time required to execute
except:
print("Error")
finally:
duplicateHandles.close();
I need to use an alternative for set() for the lookup