How to read and write two DataFrames in parallel with Apache Spark

Question

I am creating two DataFrames from JDBC data source and then writing them both to an S3 bucket. Timestamps of the files written to S3 are 20 seconds apart which tells me that the operations are not executed in parallel. Data is loaded from the same database/table and same amount of rows for testing purposes. How can I make both read and writes to be executed in parallel?

The python script runs on AWS Glue development endpoint with 2 DPUs, standard worker type.

df1 = spark.read.format("jdbc").option("driver", driver).option("url", url).option("user", user).option("password", password).option("dbtable", query1).option("fetchSize", 50000).load()
df2 = spark.read.format("jdbc").option("driver", driver).option("url", url).option("user", user).option("password", password).option("dbtable", query2).option("fetchSize", 50000).load()

df1.write.mode("append").format("csv").option("compression", "gzip").option("timestampFormat", "yyyy.MM.dd HH:mm:ss,SSS").option("maxRecordsPerFile", 1000000).save("s3://bucket-name/test1")
df2.write.mode("append").format("csv").option("compression", "gzip").option("timestampFormat", "yyyy.MM.dd HH:mm:ss,SSS").option("maxRecordsPerFile", 1000000).save("s3://bucket-name/test2")

Possible duplicate of [How to run independent transformations in parallel using PySpark?](https://stackoverflow.com/questions/38048068/how-to-run-independent-transformations-in-parallel-using-pyspark) — user10938362, May 20 '20 at 15:38

score 0 · Answer 1 · answered May 21 '20 at 04:25

0

Enable concurrent execution of glue jobs then run the job twice as in a single job it is not possible to save dataframe parallely since spark is distributed processing.

answered May 21 '20 at 04:25

Shubham Jain

5,327
2
15
38

How to read and write two DataFrames in parallel with Apache Spark

1 Answers1