-1

I need to code this task in java. I have 2 large files around 5GB each containing text data of multiple rows. Each row is a line of comma separated fields, for example "name,empId,designation,address,...,so on up to 30 fields". I need to read these 2 files and write the records to another file with additional field which specifies the given data row is Changed, Not Changed, Added or Deleted. For example

File1

Tom,E100,Engineer

Rick,E200,Engineer

File2

Tom,E100,Manager

Paul,E300,Clerk

ResultFile

Tom,E100,Manager,Changed

Paul,E300,Clerk,Added

Rick,E200,Engineer,Deleted

Approach I used is to create a map from the data of file1 using empId as the key and entire data row as value (assuming empId is unique) and then read each record from file2 to check against the data in the map (I am not reading entire content of file2 into memory, but only file1 to create the map). I am using BufferedReader/BufferedWriter for reading and writing.

This approach works fine but only for small data file. Given data files that runs into GBs my program runs out of memory very soon while trying to create the map.

What would be the right approach to achieve this task both in terms of memory and speed of execution?

Thanks, LX

  • 1
    Could you get the files ordered by **empId**? Than you would not need to store any of the files in memory. (So maybe sort them forst by **empId**). – MrSmith42 Jul 18 '16 at 15:37
  • 1
    Related: http://stackoverflow.com/q/30653705/572670 – amit Jul 18 '16 at 15:41

1 Answers1

1

A different approach could be to do an external sort on each file based on the key, and then iterate them in parallel.

High level pseudo code:

sort(file1)
sort(file2)
iter1 = file1.begin()
iter2 = file2.begin()
while (iter1 != file1.end() && iter2 != file2.end()):
  element1 = iter1.getElement()
  element2 = iter2.getElement()
  if element1.key() == element2.key():
     // same element, check if changed
     iter1 = iter1.next()
     iter2 = iter2.next()
  else if element1.key() < element2.key()
     // element1 is not in file2, so it is removed.
     iter1 = iter1.next()
  else 
     // element2 is in file2 but not in file1, so it's added
     iter2 = iter2.next()

while (iter1 != list1.end()):
  element1 = iter1.getElement()
  // element1 is removed 
  iter1 = iter1.next()

while (iter2 != list2.end()):
  element2 = iter2.getElement()
  // element2 is added
  iter2 = iter2.next()

This requires sorting, which can be done with little memory signature when doing external sort, and the next loops also use constant amount of memory. Complexity is O(mlogm + nlogn), where n,m being the lists sizes

amit
  • 175,853
  • 27
  • 231
  • 333