How to effectively run tasks parallelly in pyspark

Question

I am working on writing a framework that basically does a data sanity check. I have a set of inputs like

{
"check_1": [
sql_query_1,
sql_query_2
],

"check_2": [
sql_query_1,
sql_query_2
],

"check_3": [
sql_query_1,
sql_query_2
]
.
.
.
"check_100": [
sql_query_1,
sql_query_2
]
}

As you can see, there are 100 checks, and each check is comprised of at most 2 SQL queries. The idea is we get the data from the SQL queries and do some diff for data quality check.

Currently, I am running check_1, then check_2, and so on. Which is very slow. I tried to use joblib library to parallelize the task but got an erroneous result. Also, come to know it is not a good idea to use multithreading in pyspark.

How can I achieve parallelism here? My idea is to

run as many checks as I can in parallel
Also run the SQL queries in parallel for a particular check if possible ( I tried with joblib, but got an erroneous result, more here)

NOTE: Fair schedular is on in spark

You could union all queries adding a column for the check/query number — ScootCork, Apr 28 '22 at 10:04
Hi @ScootCork, thanks for the suggestion, but the columns returned from the queries are different from different checks. Also union all them means if some query in some check fails then the whole sanity check will fail. I need to run the checks separately but in parallel. — Epsi95, Apr 28 '22 at 10:13

user2314737 · Answer 1 · 2022-05-01T10:43:44.033

Run 100 separate jobs each with their own context/session

Just run each of the 100 checks as a separate Spark job and the fair scheduler should take care of sharing all available resources (memory/CPUs, by default memory) among jobs.

By default each queue bases fair sharing of resources based on memory (https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Introduction):

See also https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Allocation_file_format

Fair scheduling is a method of assigning resources to applications such that all apps get, on average, an equal share of resources over time. Hadoop NextGen is capable of scheduling multiple resource types. By default, the Fair Scheduler bases scheduling fairness decisions only on memory. It can be configured to schedule with both memory and CPU, using the notion of Dominant Resource Fairness

schedulingPolicy: to set the scheduling policy of any queue. The allowed values are “fifo”/“fair”/“drf” or any class that extends org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.SchedulingPolicy. Defaults to “fair”. If “fifo”, apps with earlier submit times are given preference for containers, but apps submitted later may run concurrently if there is leftover space on the cluster after satisfying the earlier app’s requests.

Submit jobs with separate threads within one context/session

On the other hand it should be possible to submit multiple jobs within a single application as long as each has its own thread. I assume one would use multiprocessing.

From Scheduling within an application

Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).

See also How to run multiple jobs in one Sparkcontext from separate threads in PySpark?

thanks for the response. Is it not overkill to run 100 spark jobs for the same. Currently I am using spark-submit to run the application, does that mean I need to do 100 spark submits? — Epsi95, Apr 29 '22 at 09:48
@Epsi95 I added a second part to the answer. It should be possible to submit multiple jobs within one session, the question is which approach is faster. Would be interesting to compare. — user2314737, May 01 '22 at 10:44

How to effectively run tasks parallelly in pyspark

1 Answers1

Run 100 separate jobs each with their own context/session

Submit jobs with separate threads within one context/session