1

When use spark sql to read jdbc data, spark will start only 1 partition in default. But when table is too big, spark will read very slow.
I know there are two ways to make partitions :
1. set partitionColumn,lowerBound,upperBound and numPartitions in option;
2. set an array of offsets in option;
But my situation is :
My jdbc table have no INT column or string column can easily separated by offsets for these two ways.
With these 2 ways won't work in my situation, is there any others ways to manage spark read jdbc data partitionally?

AI Joes
  • 69
  • 11
  • 2
    There must be be something you can partition on. The whole MapReduce paradigm and parallel processing lays on data partitioning to perform parallel operations. So would you care giving more information about your data so we can try to help ? As is, your question is unsalvageable and subject to being closed. – eliasah Mar 05 '18 at 09:39
  • @eliasah I add an image link to a snapshoot of jdbc table, I have 10+ tables in db, and the columns are not the same... – AI Joes Mar 05 '18 at 09:45
  • I imagine that you have a finite number of package name for that table, per example. Here is your partition. – eliasah Mar 05 '18 at 09:46
  • @eliasah thanks! but can you give me an example? Is partition should be a range of offsets?How to partition with specific string? – AI Joes Mar 05 '18 at 10:07
  • I'm sorry I can't give any example with the given information. You'll need to study your data distribution whatsoever. – eliasah Mar 05 '18 at 10:12
  • @eliasah I mean ususlly we use a range of string to do partion like : `2017-3-6 -> 2017-3-7`, but how can I use two specify value(eg. `pname1 pname2`) to do partition? – AI Joes Mar 07 '18 at 09:00
  • 1
    As @eliasah already mentioned you should have a unique key for your table otherwise you can't take advantage of spark features. Spark needs that column to create hash keys for the partitions. In the case that you indeed don't have any though you can use fetchsize property. Finally it could help to post your table schema – abiratsis Mar 07 '18 at 10:35

1 Answers1

0

Take a look at this question... the solution is to use a pseudorandom column from the database and partition on number of rows that you want to read.

Spark JDBC pseudocolumn isn't workingenter link description here

Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
Devang
  • 27
  • 1
  • 5