-2

I'm new to spark, I have below code to convert the given column to lowercase and update the given data frame. I found this logic on the net which is not working for me.

Data: test.csv

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,rock
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,rock
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,rock

I want to convert the first column hashID values to lowercase "aaaaaaaaaaaaaaaaa" for this I have this below code

import com.holdenkarau.spark.testing.{RDDComparisons, SharedSparkContext}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, lower}
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.scalatest.{BeforeAndAfter, FunSuite}

 class Test extends FunSuite with SharedSparkContext with RDDComparisons with BeforeAndAfter 
 with Serializable {

 test(" test lowerCase") {

  val testSchema = StructType(
  Array(
    StructField("hashID", StringType, false),
    StructField("name", StringType, false)
  ))

val builder = SparkSession.builder()
builder.master("local[*]")

// Build spark session
val spark = builder
  .config("spark.driver.maxResultSize", "0")
  .appName("testData")
  .config("spark.driver.extraJavaOptions", "-Xss10M")
  .getOrCreate()

var DF = spark.read.format("csv").option("header", "false").schema(testSchema).load("~/test.csv")

println("before")
val colName="hashID"
DF.select(colName).take(2).foreach(println)
DF.withColumn(colName, lower(col(colName)))
println("after")
DF.select(colName).take(2).foreach(println)
}
}
SCouto
  • 7,808
  • 5
  • 32
  • 49
Raj
  • 401
  • 6
  • 20

1 Answers1

2

It's just because you are not assigning the result to any DF, and since you are always using the same variable (DF), you are always printing the original values.

You just need to change one line:

DF = DF.withColumn(colName, lower(col(colName)))

The complete piece of code will be:

println("before")
val colName="hashID"
DF.select(colName).take(2).foreach(println)
DF = DF.withColumn(colName, lower(col(colName)))
println("after")
DF.select(colName).take(2).foreach(println)
SCouto
  • 7,808
  • 5
  • 32
  • 49
  • Thanks this worked, but DF is a var and not a val so it should modify it in place right? – Raj Apr 19 '20 at 17:39
  • 1
    As a matter of fact you can do this because it's a var (variable) and not a constant value (val) Generally speaking, is recommended to use val instead of var, so you can do: val dfLowerCase = DF.withColumn(colName, lower(col(colName))) dfLowerCase.show(false) and use dfLowerCase instead of DF from that line on – SCouto Apr 19 '20 at 17:41
  • both of us thinking the same, but in the original code DF is var and not oval. So Df.withColumn should update the DF in place as DF is var right? – Raj Apr 19 '20 at 17:43
  • No, Dataframes are inmutable in spark, any transformacion will create a new one. So withColumn returns a new DF, it never modifies an existing one. So no mather if you are using val or val, withColumn will return a complete new one. Using var just allow to change the assigned value. – SCouto Apr 19 '20 at 17:44
  • You can find a good and detailed explanation here https://stackoverflow.com/questions/53374140/if-dataframes-in-spark-are-immutable-why-are-we-able-to-modify-it-with-operatio – SCouto Apr 19 '20 at 17:46
  • Thanks much appreciate it! – Raj Apr 19 '20 at 17:52