0

I have a processing data pipeline including 3 methods ( let's say A(), B(), C() sequentially) for an input text file. But I have to repeat this pipeline for 10000 different files. I have used adhoc multithreading: create 10000 threads, and add them to threadPool...Now I switch to Spark to achieve this parallel. My question are:

  1. If Spark can do better job, guide me basic steps please cause I'm new to Spark.
  2. If I use adhoc multithreading, deploy it on cluster. How can i manage resource to allocate threads running equally among nodes.I'm new to HPC system too.

I hope I ask the right questions, thanks !

rogue-one
  • 11,259
  • 7
  • 53
  • 75

0 Answers0