2

I have a set of parquet files which I want to load onto a database which is an unsupported sink in spark. The object to communicate with the database, cannot be used within a spark function such as foreach as it fails to serialize. I also cannot use collect() as the data does not fit in memory.

Is there a functionality in spark to iterate through the RDD in memory of the driver instance?

Blaze
  • 51
  • 1
  • 3

1 Answers1

0

You don't want to batch process you want to process by partition:

ForeachParition will allow you to create several closers to write the data into your [unsupported database]. You will need to initiate the database connection internally to the ForeachPartition code block so that it's creating the connection inside each executor.

(code inside the ForeachPartition codeblock is run on the executors not the driver. This will get around trying to initiate a connection on your driver and then shipping that live connection to all of your executors. --> serialization error you keep getting. )

There are lots of good resources out there that help explain closures and what they mean. But if you just want to learn it in depth just initiate the database connection inside the ForeachPartition code block.

Matt Andruff
  • 4,974
  • 1
  • 5
  • 21