3

I need to read a large dataset from a file, convert it into a Spark matrix and run some machine learning algorithms on the matrix. I want to benchmark the speed of the machine learning algorithms. Because the Spark RDDs are always lazily evaluated, it's difficult to benchmark the machine learning algorithm. When I measure the runtime, it also includes the runtime for parsing the input file.

Is there a way to force Spark to materialize some RDDs? so that I can parse the input file in advance before running the machine learning algorithm?

Thanks, Da

Da Zheng
  • 111
  • 2
  • 8

2 Answers2

4

I usually do something like this:

val persisted = rdd.persist(...); 

Here it depends on size of your rdd, if it fits into memory provide memory only, otherwise - memory and disk level.

And then:

persisted.count();
// now you can use 'persisted', it's materialized

and then all other pipeline transformations (ml in your case)

so count is an action - so it materializes rdd and since you've persisted it before - next stages will take rdd from persisted storage and not from file

Tzach Zohar
  • 37,442
  • 3
  • 79
  • 85
Igor Berman
  • 1,522
  • 10
  • 16
  • If you are persisting in memory only you can use val persisted = rdd.cache(). It has exactly the same effect as persist in memory only – PinoSan Mar 19 '16 at 23:42
  • I guess this is what you mean. I set the RDD persistent with data stored in memory and run count(). It does trigger Spark to parse the input file. However, when I run corr() on the persistent data, I don't see any speedup. Am I doing the right thing? lines = sc.textFile(sys.argv[1]) data = lines.map(parseVector) data1 = data.persist(storageLevel=StorageLevel.MEMORY_ONLY) data1.count() start_time = time.time() corr = Statistics.corr(data1, method=corrType) end_time = time.time() print("%s seconds" % (end_time - start_time)) – Da Zheng Mar 20 '16 at 01:04
  • Have you validated that your file can be stored in memory? It might be that bottleneck is in ml and not in parsing. – Igor Berman Mar 20 '16 at 06:36
  • I guess I need to increase the heap size for JVM (now the heap size is much larger than the dataset size). After I did that, I do see some performance improvement. – Da Zheng Mar 20 '16 at 17:08
0

you can also use Shared Variables and Accumulators if your application (ML) is having multiple tasks that use data1.count (referring the comments section of @Igor) refer to this source