Other ways to make spark read jdbc partitionly

Question

When use spark sql to read jdbc data, spark will start only 1 partition in default. But when table is too big, spark will read very slow.
I know there are two ways to make partitions :
1. set partitionColumn,lowerBound，upperBound and numPartitions in option;
2. set an array of offsets in option;
But my situation is :
My jdbc table have no INT column or string column can easily separated by offsets for these two ways.
With these 2 ways won't work in my situation, is there any others ways to manage spark read jdbc data partitionally?

There must be be something you can partition on. The whole MapReduce paradigm and parallel processing lays on data partitioning to perform parallel operations. So would you care giving more information about your data so we can try to help ? As is, your question is unsalvageable and subject to being closed. — eliasah, Mar 05 '18 at 09:39
@eliasah I add an image link to a snapshoot of jdbc table, I have 10+ tables in db, and the columns are not the same... — AI Joes, Mar 05 '18 at 09:45
I imagine that you have a finite number of package name for that table, per example. Here is your partition. — eliasah, Mar 05 '18 at 09:46
@eliasah thanks! but can you give me an example? Is partition should be a range of offsets?How to partition with specific string? — AI Joes, Mar 05 '18 at 10:07
I'm sorry I can't give any example with the given information. You'll need to study your data distribution whatsoever. — eliasah, Mar 05 '18 at 10:12
@eliasah I mean ususlly we use a range of string to do partion like : `2017-3-6 -> 2017-3-7`, but how can I use two specify value(eg. `pname1 pname2`) to do partition? — AI Joes, Mar 07 '18 at 09:00
As @eliasah already mentioned you should have a unique key for your table otherwise you can't take advantage of spark features. Spark needs that column to create hash keys for the partitions. In the case that you indeed don't have any though you can use fetchsize property. Finally it could help to post your table schema — abiratsis, Mar 07 '18 at 10:35

score 0 · Answer 1 · edited May 25 '18 at 02:27

0

Take a look at this question... the solution is to use a pseudorandom column from the database and partition on number of rows that you want to read.

Spark JDBC pseudocolumn isn't working enter link description here

edited May 25 '18 at 02:27

Stephen Rauch

47,830
31
106
135

answered May 25 '18 at 02:05

Devang

27
1
5

Other ways to make spark read jdbc partitionly

1 Answers1