Is it possible to give multiple columns in partitionColumns while working with Spark-JDBC?

Question

I am trying to read a table on postgres db and load the data into a Hive table on HDFS as below:

val yearDF = spark.read.format("jdbc").option("url", connectionUrl).option("dbtable", s"(${execQuery}) as year2017")
.option("user", devUserName).option("password", devPassword)
.option("partitionColumn","source_system_name")
.option("lowerBound", 10000).option("upperBound", 50000)
.option("numPartitions",12).load()

My job is failing with the message: Container exited with a non-zero exit code 143 & GC Overhead issue.

The data is huge and causing skew in partitions. After observing the data in the table, I understood that 'source_system_name' is not the right column to partition the data on. But I have found that a combination of three other columns: source_system_name, org_code, period which are best suited to split the data while reading and processing. (I did a 'group by' to find out count of rows in the table and the data is a bit evenly spread across each group) But I don't know how can I use all three columns in the spark-jdbc's partitionColumn option. Could anyone let me know if it is possible to use multiple columns in the option: partitionColumn while reading data from an RDBMS table.

Is this a one off load or something you will need to repeat many times? — Terry Dactyl, Oct 05 '18 at 12:23
I have to run it adhoc. Whenever required but I see this will be used more if the implementation is successful. — Metadata, Oct 05 '18 at 12:25
Can you alter your source table to provide a unique incrementing integer column? If so job done. — Terry Dactyl, Oct 05 '18 at 12:34
There is a column: forecast_id which a unique integer column. But there are millions of rows in the table and how can I define the partitionColumn as forecast and decide the lowerBound, upperBound, partitionColumn using forecast_id. — Metadata, Oct 05 '18 at 12:38
I don't think uniqueness is enough to eliminate the possibility of skew. You need to make sure they relevant values are fairly evenly distributed. If you use an increasing integer field you know the first value is zero so a simple select statement is all you need to determine the upper bound. — Terry Dactyl, Oct 05 '18 at 12:41

score 3 · Accepted Answer · answered Oct 05 '18 at 19:39

3

No, the following applies:

For partitionColumn, lowerBound, upperBound - these options must all be specified if any of them is specified. In addition, numPartitions must be specified.
They describe how to partition the table when reading in parallel from multiple workers. partitionColumn must be a numeric column from the table in question. This latter implies it can only be 1 column.
lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned.

answered Oct 05 '18 at 19:39

thebluephantom

16,458
8
40
83

Ok. I read the data by partitioning it on an integer column of my data. But I'm still unable to partition huge data properly while reading it and hence facing a data skew problem. One of the reason I had to post the question here is of the below issue. Would you mind take a look at it https://stackoverflow.com/questions/52603131/how-to-fix-data-shuffling-issue-to-avoid-exception-java-lang-outofmemoryerror – Metadata Oct 05 '18 at 20:49
I was reading the data by partitioning on an integer column cast_id which also unique. – Metadata Oct 05 '18 at 20:51
There are other questions with same error, hence it is getting duplicated if I post it. But in my case the issue is with skew and not memory. Akward to talk about another question here, I don't think it will be a duplicate if one reads it but if you don't think you have answered it earlier could just take a look at it ? – Metadata Oct 05 '18 at 21:15
Analyze your data which I think you have done and create a second job ao that you can partition both jobs accordingly. – thebluephantom Oct 05 '18 at 21:23
One last thing before I close this discussion. "create a second job ao that you can partition both jobs accordingly" how can I implement this ? – Metadata Oct 06 '18 at 07:19
Just 2 apps jobs that write to 2 different stores, then a third that reads from. Or 2 reads with a Union in same Job. Nood breekt wet as we say in Dutch. – thebluephantom Oct 06 '18 at 07:23

Is it possible to give multiple columns in partitionColumns while working with Spark-JDBC?

1 Answers1