I have a processing data pipeline including 3 methods ( let's say A(), B(), C() sequentially) for an input text file. But I have to repeat this pipeline for 10000 different files. I have used adhoc multithreading: create 10000 threads, and add them to threadPool...Now I switch to Spark to achieve this parallel. My question are:
- If Spark can do better job, guide me basic steps please cause I'm new to Spark.
- If I use adhoc multithreading, deploy it on cluster. How can i manage resource to allocate threads running equally among nodes.I'm new to HPC system too.
I hope I ask the right questions, thanks !