0

I am creating two DataFrames from JDBC data source and then writing them both to an S3 bucket. Timestamps of the files written to S3 are 20 seconds apart which tells me that the operations are not executed in parallel. Data is loaded from the same database/table and same amount of rows for testing purposes. How can I make both read and writes to be executed in parallel?

The python script runs on AWS Glue development endpoint with 2 DPUs, standard worker type.

df1 = spark.read.format("jdbc").option("driver", driver).option("url", url).option("user", user).option("password", password).option("dbtable", query1).option("fetchSize", 50000).load()
df2 = spark.read.format("jdbc").option("driver", driver).option("url", url).option("user", user).option("password", password).option("dbtable", query2).option("fetchSize", 50000).load()

df1.write.mode("append").format("csv").option("compression", "gzip").option("timestampFormat", "yyyy.MM.dd HH:mm:ss,SSS").option("maxRecordsPerFile", 1000000).save("s3://bucket-name/test1")
df2.write.mode("append").format("csv").option("compression", "gzip").option("timestampFormat", "yyyy.MM.dd HH:mm:ss,SSS").option("maxRecordsPerFile", 1000000).save("s3://bucket-name/test2") 
jimmone
  • 446
  • 1
  • 6
  • 15
  • 3
    Possible duplicate of [How to run independent transformations in parallel using PySpark?](https://stackoverflow.com/questions/38048068/how-to-run-independent-transformations-in-parallel-using-pyspark) – user10938362 May 20 '20 at 15:38

1 Answers1

0

Enable concurrent execution of glue jobs then run the job twice as in a single job it is not possible to save dataframe parallely since spark is distributed processing.

Shubham Jain
  • 5,327
  • 2
  • 15
  • 38