0

I have a scenario where user is going to upload a zip file. This zip file can have 4999 json files, each json file can have 4999 nodes which I am parsing and creating objects. Eventually I am inserting them in db. When I tested this scenario it took me 30-50 min to parse.

I am looking for suggestions where

  1. I want to read JSON files in parallel: let's say if I have a batch of 100 jsonfiles then I can have 50 threads running in parallel

  2. Each thread will be responsible for parsing the JSON files, which might result in another perf bottleneck as we have 4999 nodes to parse. So I was thinking another batch of 100 node reads at a time which will cause 50 child threads again

So in total there will be 2500 threads in the system but should help parallel execution of around 25,000,000 sequential operations.

Let me know if this approach sounds fine or not?

Nabs
  • 553
  • 5
  • 17
  • 3
    _measure_, don't guess – Eugene Nov 24 '20 at 04:33
  • You said "I have some code; how can I make it faster?". You didn't show the code, you didn't show measurements of where the slow parts are. – chrylis -cautiouslyoptimistic- Nov 24 '20 at 04:46
  • Use a profiler to find out which aspect (method call stack) of your process consumes most of the time. Edit the results into your question. Tell us which library/framework you're using for JSON parsing and database insertion. Show us example code. Without these informations, you can't hope for useful answers, only guesswork. – Ralf Kleberhoff Nov 24 '20 at 11:02
  • 1. right, measure, what is the most time consuming part of the process (remember about Amdahl's low) 2. if JSON parsing takes significant part of time, don't use object mapping at all, use event/token/stream based parsing adding new insert into JDBC batch with PreparedStatement for each node on-the-fly 3. tune your DB to improve performance of inserts (optimize or even drop indexes before run, sharde db, use SSD etc etc). 4. find optimal num of threads for the job, 2500 threads may slow down the system because of context switching, IO/memory bus and other contentions... – AnatolyG Dec 06 '20 at 20:45

3 Answers3

1

What you described should not take so much time (30-50 min to parse), also a json file with ~5k nodes is relatively small. The bottleneck will be in database, during mass insert, especially if you have indexes on fields.

So i suggest to:

  1. Don't waste time on threading - unpacking and parsing jsons should be fast in your case, focus on batch inserts and do it properly: 1000+ batch queue and manual commit after.
  2. Disable indexes before importing, especially full-text and enable (+reindex) after
Alex Chernyshev
  • 1,719
  • 9
  • 11
0

I think, the performance problem may come from:

  1. JSON parsing & create objects
  2. Inserting data to DB: if you insert many times, performance reduce a lot

If you run 2500 threads, it's may not effective if you don't have much CPU, since the overhead may increase. Depend on your HW configuration, you can define number of thread.

And to insert data to DB, I suggest to do as bellow:

  • Each thread, after JSON parsing and create objects, you put the objects into CSV file
  • After finish, try to import CSV to DB
binhgreat
  • 982
  • 8
  • 13
0

I would suggest you using DSM library. With DSM, you can easily parse very complex JSON files and process them during parsing. You don't need to wait until all JSON files being processing. I guess this is your main problem. BTW: It uses Jackson stream API to read JSON so, it consumes very low memory.

Example usage can be found in this answer:

JAVA - Best approach to parse huge (extra large) JSON file

mfe
  • 1,158
  • 10
  • 15