0

I have 2 databases A and B in my RDS cluster. It’s Postgres based. I have 2 glue connections setup one for A and another for B.

I have a helper function that gets details such as host, url, port, username and password from the respective connections.

I am trying to read data from A using spark, store it in a df, do minimal transformations and perform a df.count(). The dataset has about 400 million records. The Glue Job runs for about 36 mins and throws an error: stage failure, Remote RPC client dissociated, likely due to containers exceeding threshold limits or network issues.

The same thing happens when I write the data to database B. I tried increasing the No of executors and upgrading the glue worker type to G.2X. It still throws an error after 1hr 10 mins. The glue job timeout is 360 minutes.

How do I make this work ? Do caching df to disk solve this issue? Please help.

chexxmex
  • 117
  • 1
  • 8
  • Divide your data into 100(maybe 50) parts and then write each part of data to database B in a loop. – Linus Aug 04 '22 at 03:50
  • Do you mean repartition ? Can you give me an example, Please – chexxmex Aug 04 '22 at 04:16
  • Plz share pyspark code so that we can provide more suggestions..looking at problem we can implement data batching – devesh Aug 04 '22 at 08:18
  • Have you tried reading in parallel https://docs.aws.amazon.com/glue/latest/dg/run-jdbc-parallel-read-job.html or filtering data before reading https://stackoverflow.com/a/54375010/4326922 ? – Prabhakar Reddy Aug 05 '22 at 03:20
  • Thank you. I was able to figure out this issue. However, after loading the data into a df, it takes a lot of time doing the count. Approx 2hrs 40 mins for 200Mil records. – chexxmex Aug 06 '22 at 20:58
  • Hi i have a similar issue.. Can you please suggest? https://stackoverflow.com/questions/75199778/aws-glue-executorlostfailure-executor-15-exited-caused-by-one-of-the-running-ta – Vijeth Kashyap Jan 22 '23 at 11:17

0 Answers0