I am creating two DataFrames from JDBC data source and then writing them both to an S3 bucket. Timestamps of the files written to S3 are 20 seconds apart which tells me that the operations are not executed in parallel. Data is loaded from the same database/table and same amount of rows for testing purposes. How can I make both read and writes to be executed in parallel?
The python script runs on AWS Glue development endpoint with 2 DPUs, standard worker type.
df1 = spark.read.format("jdbc").option("driver", driver).option("url", url).option("user", user).option("password", password).option("dbtable", query1).option("fetchSize", 50000).load()
df2 = spark.read.format("jdbc").option("driver", driver).option("url", url).option("user", user).option("password", password).option("dbtable", query2).option("fetchSize", 50000).load()
df1.write.mode("append").format("csv").option("compression", "gzip").option("timestampFormat", "yyyy.MM.dd HH:mm:ss,SSS").option("maxRecordsPerFile", 1000000).save("s3://bucket-name/test1")
df2.write.mode("append").format("csv").option("compression", "gzip").option("timestampFormat", "yyyy.MM.dd HH:mm:ss,SSS").option("maxRecordsPerFile", 1000000).save("s3://bucket-name/test2")