0

I am querying a database using spark.read.jdbc method that is quite large and getting the following error:

com.mysql.cj.jdbc.exceptions.PacketTooBigException: Packet for query is too large (15,913,800 > 4,194,304)

which indicates the retrieved data is too large.
I don't have the option to alter the database settings and I need to be able to retrieve all of the data so I would like to read the data in chunks and have the result be a dataframe. How can I achieve this?

For example, in python I can query a database using pandas and read it in chunks docs

alex
  • 1,905
  • 26
  • 51

1 Answers1

1

If you look to the documentation, you can find the fetchsize option that you may pass to the spark.read.jdbc...

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Thought the general way was number of partitions, etc. – thebluephantom Feb 22 '21 at 18:36
  • @thebluephantom I think that would be for specifying parallelism - not # of rows. Alex Ott answer looks correct – alex Feb 22 '21 at 19:03
  • True, but in an abstract way it also achieves the same I think if I read the question. Anyway. – thebluephantom Feb 22 '21 at 19:04
  • if the table was consistently the same size yes it would if I used a large enough parallelism. but if the table is growing in size then at some point I'd reach the same error. so using `fetchsize` for my use case is needed as my target table will continue to grow – alex Feb 22 '21 at 19:13
  • @thebluephantom by default, there is no parallelization on read, until you specify `partitionColumn`, `lowerBound`, `upperBound`. See this answer: https://stackoverflow.com/a/43150938/18627 – Alex Ott Feb 22 '21 at 19:17