Processing a huge database table with Spark

Asked Dec 21 '17 at 12:12

Active Dec 21 '17 at 13:40

Viewed 2,336 times

I have a huge database table which contains millions of records. Each record can be processed in isolation and it has to be converted in, let's say, a string.

So I started looking around and I was wondering if Spark could help me in this scenario. Specifically, I wrote something very simple like this:

session.read.jdbc(...).rdd
    .map(row => ...convert each row in a string)
    .saveAsTextFile(....)

Problem: it works perfectly with small / medium tables, but I get OutOfMemory in case of huge tables.

Even if I think I got how the jdbc partitioning works (and it's working), it seems that session.read.jdbc is returning (i.e. is moving each row to the map method) only after the whole dataset has been loaded.

Is it possible, using this approach or another, to convert (i.e. process) each row as it is read?

I already had a look at similar question as pointed in the comments below, but there, that user is doing an aggregation (df.count) while I just need to iterate the records one by one, so I was wondering if this "lazy" iteration is possible.

Thx

edited Dec 21 '17 at 13:05

asked Dec 21 '17 at 12:12

Andrea

2,714
3
27
38

I edited my post, see above. Thanks anyway – Andrea Dec 21 '17 at 13:06
2

What is done after loading data is not important in this context. What matters, is how you load it. – Alper t. Turker Dec 21 '17 at 13:17

Processing a huge database table with Spark

0 Answers0