0

I'm trying to read a large (in GBs) file with JSON lines, do some 'processing' and write the result to another file. I'll be using GSON streaming API for the purpose. To speed up the processing, I'd like to multithread the 'processsing' part. I'm reading the file line by line since I can't load the whole file in memory. My 'processing' depends on two different lines(possibly thousands of lines apart) that meet certain conditions. Is it possible to multithread this 'processing' , without loading the whole thing in memory ?

Community
  • 1
  • 1
Neeraj
  • 2,376
  • 2
  • 24
  • 41
  • 2
    If you are only reading from a single file and writing to a single file, multiple threads won't speed up the IO, which is probably the bottleneck (unless your processing is intense --- what does it do?) – Thilo Oct 08 '19 at 10:12
  • 1
    Take a look at: [how to parse a huge JSON file without loading it in memory](https://stackoverflow.com/questions/54817985/how-to-parse-a-huge-json-file-without-loading-it-in-memory/54818259#54818259), [Parse only one field in a large JSON string](https://stackoverflow.com/questions/54852415/parse-only-one-field-in-a-large-json-string/54854433#54854433). If you already use `Spring` you can try to use [Batch processing](https://spring.io/guides/gs/batch-processing/) which is created for problems like yours. It introduces: readers, writers, processors which should fit to your problem. – Michał Ziober Oct 08 '19 at 22:41

2 Answers2

2

Any suggestions on how to go about this ?

Well a high level design would be to have a reader thread, a writer thread and an ExecutorService instance to do the processing.

  • The reader thread reads the JSON file using a streaming API1. When it has identified a unit of work to be performed, it creates a task and submits it to the executor service, and repeats.

  • The executor server processes the tasks it is given. You should use a service with a bounded thread pool, and possibly a bounded / blocking work queue.

  • The writer thread scans the Future objects created by task submission, and uses them to get the task results (in order), generate the output from the results and write the output to the file.

If the output file doesn't need to be in order, you could dispense with writer thread2, and have the tasks write to the file. They will need to use a shared lock or mutex so that only one tasking is writing to the file at a time.

1 - If you don't, then: 1) you need to be able to parse and hold the entire input file in memory, and 2) the reader thread won't be able to start submitting tasks until it has finished parsing the input.

2 - Do this if it simplifies things, not for performance reasons. The need for mutual exclusion while writing kills any hypothetical performance benefits.


As @Thilo notes, there is little to be gained by trying to have multiple reader threads. (And a whole lot of complexity if you try!)

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
1

I think you'll have a single process reading from the file which adds workers (Runnable/Callable) to a queue. You then have a pool of threads which consumes from the queue and executes the workers in parallel.

See Executors static methods which can help creating a ExecutorService

lance-java
  • 25,497
  • 4
  • 59
  • 101