3

I'm following an example for PCA analysis in Spark 3.0.0, using Scala 2.12.10. I'm having trouble understanding some of the nuances of Scala and I'm quite new to programming in Scala.

After defining the data as such:

val data = Array(
            Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
            Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
            Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
        )

the dataframe is created as such:

val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")

My question is: what does data.map(Tuple1.apply) do? I guess what bugs me is the fact apply doesn't have arguments.

Thank you in advance! Perhaps someone can also recommend me a good beginner Scala / Spark book so my questions can be better ones in the future?

MDSvensson
  • 33
  • 6

2 Answers2

2

It makes a Tuple of 1 element that the toDF can use as input to create a Dataframe with 1 column of type vector. That's all, but very handy.

Some references https://mungingdata.com/apache-spark/best-books/. I found the Databricks courses too simple and omitting relevant aspects. Some good sites exist: https://sparkbyexamples.com/ https://www.waitingforcode.com/ This latter offers a good course at little cost.

On Scala apply there is also an excellent answer on SO What is the apply function in Scala?

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • Thank you for your answer. Can you elaborate though? Focusing on the Tuple1.apply, can we go over that part? So the .map maps every element of data, the Vectors, to 1 element tuples? I've tried looking for examples of .apply but wasn't very successful. There is a lot to learn about scala. Any good books I could use? I'm using "Scala Programming for Beginners" by Ray Yao but some concepts appear to be missing there based on what I find on Stack Overflow – MDSvensson Aug 27 '20 at 11:44
  • Well, it has to do with Product. – thebluephantom Aug 27 '20 at 12:00
  • Thanks :) Scala is surely not the easiest language but it is rewarding! – MDSvensson Aug 27 '20 at 18:03
  • There is pure scala and scala with spark – thebluephantom Aug 27 '20 at 19:34
0

There is a subtlety in this line of code that is tricky for people new to Scala. To answer your question, mapping Tuple1.apply over the sequence of vectors simply creates a sequence of tuples of vectors, i.e an object of type Seq[Tuple1[org.apache.spark.ml.linalg.Vector]]. It's taking each vector in the sequence and wrapping it in a tuple. The reason for doing this is because there are a bunch of implicit conversions imported with SparkSession.implicits that allow for the conversion of standard Scala objects into DataSets. Notice that Seq doesn't have a toDF method. The scala compiler will look for an implicit conversion in scope that converts a Seq into a DataSet and do the conversion for you under the hood. Then it calls the toDF method on the newly created DataSet to produce the final DataFrame.

  • 1
    Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jul 17 '23 at 03:04