I have 2 databases A and B in my RDS cluster. It’s Postgres based. I have 2 glue connections setup one for A and another for B.
I have a helper function that gets details such as host, url, port, username and password from the respective connections.
I am trying to read data from A using spark, store it in a df, do minimal transformations and perform a df.count(). The dataset has about 400 million records. The Glue Job runs for about 36 mins and throws an error: stage failure, Remote RPC client dissociated, likely due to containers exceeding threshold limits or network issues.
The same thing happens when I write the data to database B. I tried increasing the No of executors and upgrading the glue worker type to G.2X. It still throws an error after 1hr 10 mins. The glue job timeout is 360 minutes.
How do I make this work ? Do caching df to disk solve this issue? Please help.