How to materialize an RDD explicitly in Spark

Question

I need to read a large dataset from a file, convert it into a Spark matrix and run some machine learning algorithms on the matrix. I want to benchmark the speed of the machine learning algorithms. Because the Spark RDDs are always lazily evaluated, it's difficult to benchmark the machine learning algorithm. When I measure the runtime, it also includes the runtime for parsing the input file.

Is there a way to force Spark to materialize some RDDs? so that I can parse the input file in advance before running the machine learning algorithm?

Thanks, Da

How about converting to dataframe and saveAsTable?? – Zahiro Mor Mar 19 '16 at 20:14 — Zahiro Mor, Mar 19 '16 at 20:14

score 4 · Accepted Answer · edited Mar 19 '16 at 21:34

4

I usually do something like this:

val persisted = rdd.persist(...);

Here it depends on size of your rdd, if it fits into memory provide memory only, otherwise - memory and disk level.

And then:

persisted.count();
// now you can use 'persisted', it's materialized

and then all other pipeline transformations (ml in your case)

so count is an action - so it materializes rdd and since you've persisted it before - next stages will take rdd from persisted storage and not from file

edited Mar 19 '16 at 21:34

Tzach Zohar

37,442
3
79
85

answered Mar 19 '16 at 20:21

Igor Berman

1,522
10
16

If you are persisting in memory only you can use val persisted = rdd.cache(). It has exactly the same effect as persist in memory only – PinoSan Mar 19 '16 at 23:42
I guess this is what you mean. I set the RDD persistent with data stored in memory and run count(). It does trigger Spark to parse the input file. However, when I run corr() on the persistent data, I don't see any speedup. Am I doing the right thing? lines = sc.textFile(sys.argv[1]) data = lines.map(parseVector) data1 = data.persist(storageLevel=StorageLevel.MEMORY_ONLY) data1.count() start_time = time.time() corr = Statistics.corr(data1, method=corrType) end_time = time.time() print("%s seconds" % (end_time - start_time)) – Da Zheng Mar 20 '16 at 01:04
Have you validated that your file can be stored in memory? It might be that bottleneck is in ml and not in parsing. – Igor Berman Mar 20 '16 at 06:36
I guess I need to increase the heap size for JVM (now the heap size is much larger than the dataset size). After I did that, I do see some performance improvement. – Da Zheng Mar 20 '16 at 17:08

score 0 · Answer 2 · answered May 24 '20 at 01:28

0

you can also use Shared Variables and Accumulators if your application (ML) is having multiple tasks that use data1.count (referring the comments section of @Igor) refer to this source

answered May 24 '20 at 01:28

tushar aggarwal

1

How to materialize an RDD explicitly in Spark

2 Answers2

Linked