6

My questions is:

In line number 5: What is the operation of the symbol of single apostrophes(')? Cannot understand very clearly that how withColumn function is working over here. Also Please elaborate how it is displaying like these Column order- |id |text |upper |.

Code:

 1. val dataset = Seq((0, "hello"),(1, "world")).toDF("id","text")
 2. val upper: String => String =_.toUpperCase
 3. import org.apache.spark.sql.functions.udf
 4. val upperUDF = udf(upper)
 5. dataset.withColumn("upper", upperUDF('text)).show

Output:

+---------+---------+---------+
|id       |text     |upper    |
+---------+---------+---------+
| 0       | hello   |HELLO    |
| 1       | world   |WORLD    |
+---------+---------+---------+
  • 2
    `withColumn` creates a new column if the first argument passed to it doesn't already exist in a dataframe. If it does, it overwrites the existing column. `'text` is used to select the column named `text`. In place of `'text`, you can also use `$"text"` or `org.apache.spark.sql.functions.col(text)`. The order is like that because the first two columns are `id` and `text` and `withColumn` is _appending_ a column which will get added at the end. You need to go through the documentation first. You'll find lots of interesting things there. – philantrovert Aug 11 '17 at 13:51
  • @philantrovert Thanks a lot –  Aug 11 '17 at 14:34

1 Answers1

16

The ' symbol in Scala is syntax sugar for creating an instance of the Symbol. class. From the documentation

For instance, the Scala term 'mysym will invoke the constructor of the Symbol class in the following way: Symbol("mysym").

So when you write 'text, the compiler expands it into new Symbol("text").

There is additional magic here, since Sparks upperUDF method requires a Column type, not a Symbol. But, there exists an implicit in scope defined in SQLImplicits called symbolToColumn which converts a symbol to a column:

implicit def symbolToColumn(s: Symbol): ColumnName = new ColumnName(s.name)

If we extract away all the implicits and syntax sugar, the equivalent would be:

dataset.withColumn("upper", upperUDF(new Column(new Symbol("text").name))).show
Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321