I am trying to read a table on postgres db and load the data into a Hive table on HDFS as below:
val yearDF = spark.read.format("jdbc").option("url", connectionUrl).option("dbtable", s"(${execQuery}) as year2017")
.option("user", devUserName).option("password", devPassword)
.option("partitionColumn","source_system_name")
.option("lowerBound", 10000).option("upperBound", 50000)
.option("numPartitions",12).load()
My job is failing with the message: Container exited with a non-zero exit code 143 & GC Overhead issue.
The data is huge and causing skew in partitions.
After observing the data in the table, I understood that 'source_system_name' is not the right column to partition the data on. But I have found that a combination of three other columns: source_system_name, org_code, period
which are best suited to split the data while reading and processing. (I did a 'group by' to find out count of rows in the table and the data is a bit evenly spread across each group)
But I don't know how can I use all three columns in the spark-jdbc's partitionColumn
option. Could anyone let me know if it is possible to use multiple columns in the option: partitionColumn while reading data from an RDBMS table.