I need to read a large dataset from a file, convert it into a Spark matrix and run some machine learning algorithms on the matrix. I want to benchmark the speed of the machine learning algorithms. Because the Spark RDDs are always lazily evaluated, it's difficult to benchmark the machine learning algorithm. When I measure the runtime, it also includes the runtime for parsing the input file.
Is there a way to force Spark to materialize some RDDs? so that I can parse the input file in advance before running the machine learning algorithm?
Thanks, Da