I've a json file that I want to remove duplicate rows from, but it's too large to fit into memory. I found a way to get it done, but my guess is that it's not the best way.
My problem is that it runs in 8 minutes for a 12gb dataset. But the requirement is to scale the code so that it could run on 100gb dataset. Any pointers on how to do this? Should I use multi-threading or multi-processing in python to achieve this ? Or any other method?
This is the code:
import json
import time
""" This class contains the business logic for identifying the duplicates and creating an output file for further processing """
class BusinessService:
""" The method identiifes the duplicate """
def service(ipPath,opPath):
start_time = time.time() #We start the timer to see how much time the method takes to work #
uniqueHandleSet = set(); #Creating a set to store unique values #
try:
duplicateHandles = open(opPath,'w+',encoding='utf-8') #Opening and creating an output file to catch the duplicate hanndles #
with open(ipPath,buffering = 200000000,encoding = 'utf-8') as infile: #Reading the JSON File by buffering and using 20mb as it is too big to read at once #
for line in infile:
tweetJsonObject = json.loads(line);
if tweetJsonObject["name"] not in uniqueHandleSet:
uniqueHandleSet.add(tweetJsonObject["name"]);
else:
duplicateHandles.write(line);
print("--- %s seconds --- memory 200mb while buffering" % (time.time() - start_time)); #Printing the total time required to execute
except:
print("Error")
finally:
duplicateHandles.close();