1

I have a huge database table which contains millions of records. Each record can be processed in isolation and it has to be converted in, let's say, a string.

So I started looking around and I was wondering if Spark could help me in this scenario. Specifically, I wrote something very simple like this:

session.read.jdbc(...).rdd
    .map(row => ...convert each row in a string)
    .saveAsTextFile(....)

Problem: it works perfectly with small / medium tables, but I get OutOfMemory in case of huge tables.

Even if I think I got how the jdbc partitioning works (and it's working), it seems that session.read.jdbc is returning (i.e. is moving each row to the map method) only after the whole dataset has been loaded.

Is it possible, using this approach or another, to convert (i.e. process) each row as it is read?

I already had a look at similar question as pointed in the comments below, but there, that user is doing an aggregation (df.count) while I just need to iterate the records one by one, so I was wondering if this "lazy" iteration is possible.

Thx

Andrea
  • 2,714
  • 3
  • 27
  • 38

0 Answers0