adhoc multithreading and Spark

Asked May 02 '18 at 14:11

Active May 02 '18 at 14:24

Viewed 79 times

I have a processing data pipeline including 3 methods ( let's say A(), B(), C() sequentially) for an input text file. But I have to repeat this pipeline for 10000 different files. I have used adhoc multithreading: create 10000 threads, and add them to threadPool...Now I switch to Spark to achieve this parallel. My question are:

If Spark can do better job, guide me basic steps please cause I'm new to Spark.
If I use adhoc multithreading, deploy it on cluster. How can i manage resource to allocate threads running equally among nodes.I'm new to HPC system too.

I hope I ask the right questions, thanks !

edited May 02 '18 at 14:24

rogue-one

11,259
7
53
75

asked May 02 '18 at 14:11

Jimmy Le Viet Hung

where is the 10000 input files located? in HDFS? or local filesystem ? – rogue-one May 02 '18 at 14:25
local filesystem and fit into memory (RAM) – Jimmy Le Viet Hung May 02 '18 at 14:33
Spark can read all files in a directory into a single RDD. See this answer - https://stackoverflow.com/a/24036343/864369 – Dan W May 02 '18 at 14:34
but then, how can I apply pipeline (includes 3 methods sequentially) ?Thanks Dan – Jimmy Le Viet Hung May 02 '18 at 14:41
The same way you would apply any function to an RDD. Please see Spark examples for specific information. – Dan W May 02 '18 at 15:17

adhoc multithreading and Spark

0 Answers0