Spark read database rows in chunks?

Question

I am querying a database using spark.read.jdbc method that is quite large and getting the following error:

com.mysql.cj.jdbc.exceptions.PacketTooBigException: Packet for query is too large (15,913,800 > 4,194,304)

which indicates the retrieved data is too large.
I don't have the option to alter the database settings and I need to be able to retrieve all of the data so I would like to read the data in chunks and have the result be a dataframe. How can I achieve this?

For example, in python I can query a database using pandas and read it in chunks docs

score 1 · Accepted Answer · answered Feb 22 '21 at 18:03

1

If you look to the documentation, you can find the fetchsize option that you may pass to the spark.read.jdbc...

answered Feb 22 '21 at 18:03

Alex Ott

80,552
8
87
132

Thought the general way was number of partitions, etc. – thebluephantom Feb 22 '21 at 18:36
@thebluephantom I think that would be for specifying parallelism - not # of rows. Alex Ott answer looks correct – alex Feb 22 '21 at 19:03
True, but in an abstract way it also achieves the same I think if I read the question. Anyway. – thebluephantom Feb 22 '21 at 19:04
if the table was consistently the same size yes it would if I used a large enough parallelism. but if the table is growing in size then at some point I'd reach the same error. so using `fetchsize` for my use case is needed as my target table will continue to grow – alex Feb 22 '21 at 19:13
@thebluephantom by default, there is no parallelization on read, until you specify `partitionColumn`, `lowerBound`, `upperBound`. See this answer: https://stackoverflow.com/a/43150938/18627 – Alex Ott Feb 22 '21 at 19:17

Spark read database rows in chunks?

1 Answers1