Scala: How can I replace value in Dataframes using scala

Question

For example I want to replace all numbers equal to 0.2 in a column to 0. How can I do that in Scala? Thanks

Edit:

|year| make|model| comment            |blank|
|2012|Tesla| S   | No comment         |     | 
|1997| Ford| E350|Go get one now th...|     | 
|2015|Chevy| Volt| null               | null|

This is my Dataframe I'm trying to change Tesla in make column to S

by converting to RDD with `.rdd` and using `map` to change to 0 if 0.2 ? — ccheneson, Sep 02 '15 at 15:58
Give us an example of your data, what you have tried so far. — ccheneson, Sep 02 '15 at 17:17
+----+-----+-----+--------------------+-----+ |year| make|model| comment|blank| +----+-----+-----+--------------------+-----+ |2012|Tesla| S| No comment| | |1997| Ford| E350|Go get one now th...| | |2015|Chevy| Volt| null| null| This is my Dataframe I'm trying to change Tesla in make column to S. I have just start learning Scala. Really appreciate your help! — Tong, Sep 02 '15 at 17:37
So at the end, you expect `|2012|S| S|` for the first data in your example? — ccheneson, Sep 02 '15 at 17:46
I can change the type of the column by doing this val featureDf = df.withColumn("year2", toInt(df("year"))).select("year2", "make", "model", "comment", "blank") — Tong, Sep 02 '15 at 18:50
ok. Nice answer. If particular column in dataframe have particular value then I have to change value in other columns in dataframe. — prabakaran, Apr 19 '17 at 10:15

score 49 · Answer 1 · edited Nov 16 '18 at 13:32

49

Spark 1.6.2, Java code (sorry), this will change every instance of Tesla to S for the entire dataframe without passing through an RDD:

dataframe.withColumn("make", when(col("make").equalTo("Tesla"), "S")
                             .otherwise(col("make") 
                    );

Edited to add @marshall245 "otherwise" to ensure non-Tesla columns aren't converted to NULL.

edited Nov 16 '18 at 13:32

ChrisOdney

6,066
10
38
48

answered Oct 25 '16 at 16:16

Azeroth2b

629
1
5
8

hey man, what if i want to change a column with a value from another dataframe column (both dataframes have an id column) i can't seem to make it in java spark. – Vasile Surdu Mar 09 '17 at 16:05
This is probably better served with a select .. join on id, given that, sounds like a new question. Hope that gets you started. – Azeroth2b Mar 11 '17 at 18:02
Why to edit this one and make it the same answer as @marshall245? – Eduardo Reis Nov 18 '18 at 05:45
Where can I find the doc for withColumn function? I actually have more conditions and more columns to change the values of. I got this https://docs.azuredatabricks.net/spark/1.6/sparkr/functions/withColumn.html but this is not helping. Can anyone help? – GadaaDhaariGeek Dec 27 '18 at 06:22
That might be a new question. You can chain dataframe commands. For example, you can do dataframe.withcolumn(...).withcolumn(...).. Every call returns a new dataframe, a mutation of the original according to the command. – Azeroth2b Jan 27 '19 at 20:24

score 30 · Answer 2 · edited Feb 02 '22 at 10:21

Building off of the solution from @Azeroth2b. If you want to replace only a couple of items and leave the rest unchanged. Do the following. Without using the otherwise(...) method, the remainder of the column becomes null.

import org.apache.spark.sql.functions._

val newsdf =
  sdf.withColumn(
    "make",
    when(col("make") === "Tesla", "S").otherwise(col("make"))
  );

Old DataFrame

+-----+-----+ 
| make|model| 
+-----+-----+ 
|Tesla|    S| 
| Ford| E350| 
|Chevy| Volt| 
+-----+-----+

New Datarame

+-----+-----+
| make|model|
+-----+-----+
|    S|    S|
| Ford| E350|
|Chevy| Volt|
+-----+-----+

score 15 · Answer 3 · answered Sep 17 '15 at 15:15

15

This can be achieved in dataframes with user defined functions (udf).

import org.apache.spark.sql.functions._
val sqlcont = new org.apache.spark.sql.SQLContext(sc)
val df1 = sqlcont.jsonRDD(sc.parallelize(Array(
      """{"year":2012, "make": "Tesla", "model": "S", "comment": "No Comment", "blank": ""}""",
      """{"year":1997, "make": "Ford", "model": "E350", "comment": "Get one", "blank": ""}""",
      """{"year":2015, "make": "Chevy", "model": "Volt", "comment": "", "blank": ""}"""
    )))

val makeSIfTesla = udf {(make: String) => 
  if(make == "Tesla") "S" else make
}
df1.withColumn("make", makeSIfTesla(df1("make"))).show

answered Sep 17 '15 at 15:15

Al M

557
4
10

1

I guess this will improve the performance because you are not converting df to rdd and adding a new column. – Nandakishore Oct 31 '16 at 19:41
This does not result in duplicate `make` columns? – WestCoastProjects Jul 24 '18 at 00:01
1

@javadba no, withColumn will make a new column if it doesn't exist, or replace an existing column.https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.DataFrame – Tony Nov 21 '19 at 23:03

ccheneson · Accepted Answer · 2017-03-10T13:33:42.657

13

Note: As mentionned by Olivier Girardot, this answer is not optimized and the withColumn solution is the one to use (Azeroth2b answer)

Can not delete this answer as it has been accepted

Here is my take on this one:

 val rdd = sc.parallelize(
      List( (2012,"Tesla","S"), (1997,"Ford","E350"), (2015,"Chevy","Volt"))
  )
  val sqlContext = new SQLContext(sc)

  // this is used to implicitly convert an RDD to a DataFrame.
  import sqlContext.implicits._

  val dataframe = rdd.toDF()

  dataframe.foreach(println)

 dataframe.map(row => {
    val row1 = row.getAs[String](1)
    val make = if (row1.toLowerCase == "tesla") "S" else row1
    Row(row(0),make,row(2))
  }).collect().foreach(println)

//[2012,S,S]
//[1997,Ford,E350]
//[2015,Chevy,Volt]

You can actually use directly map on the DataFrame.

So you basically check the column 1 for the String tesla. If it's tesla, use the value S for make else you the current value of column 1

Then build a tuple with all data from the row using the indexes (zero based) (Row(row(0),make,row(2))) in my example)

There is probably a better way to do it. I am not that familiar yet with the Spark umbrella

edited Mar 10 '17 at 13:33

answered Sep 02 '15 at 18:54

ccheneson

49,072
8
63
68

Thanks for your help. I have one more question. Your solution can printout the strings I want. However what if I want to change the value within the dataframe itself? When I do dataframe.show() the the value is still tesla – Tong Sep 02 '15 at 20:38
2

Dataframe are based on RDDs which are immutable. Try `val newDF = dataframe.map(row => { val row1 = row.getAs[String](1) val make = if (row1.toLowerCase == "tesla") "S" else row1 Row(row(0),make,row(2)) })` that should construct new DataFrame. – ccheneson Sep 02 '15 at 21:04
Thanks! It works! Feels so good! I set a new data frame and add a new column. – Tong Sep 03 '15 at 01:07
Hi! First thanks for solving my problem. Can I convert a DataFrame to RDD only by .rdd? Is there any risk like changing the schema? Thanks again! – Tong Sep 04 '15 at 14:21
I know only this way to convert to rdd. I dont know for the schema. You should post another question so that other people will see it and answer – ccheneson Sep 04 '15 at 15:09
Thanks! The .rdd works pretty good. Do you know how can I convert a rdd back to DataFrame? Really appreciate that you answer so many questions for me. – Tong Sep 04 '15 at 15:42
Use `toDF()` from `RDD` to `DataFrame` – ccheneson Sep 04 '15 at 15:45
I tried but it said value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]. – Tong Sep 04 '15 at 16:00
From the `sqlContext`, you can use the method `def createDataFrame(rowRDD: RDD[Row], schema: StructType): DataFrame` . – ccheneson Sep 05 '15 at 20:44
To generate a Dataframe: `import sqlContext.implicits._; dataframe.map(row => {val make = row.getAs[String]("make");(row.getAs[Int]("year"), if (make == "tesla") "S" else make,row.getAs[String]("model"), ...)}).toDF("year","make","model", ...)` – leo9r Sep 22 '16 at 05:06
4

this will break spark's catalyst optimisations, and therefore is not the best practice, the withColumn approach is best suited for this. – Olivier Girardot Jan 05 '17 at 19:29

score 3 · Answer 5 · answered Apr 29 '19 at 11:37

3

df2.na.replace("Name",Map("John" -> "Akshay","Cindy" -> "Jayita")).show()

replace in class DataFrameNaFunctions of type [T](col: String, replacement: Map[T,T])org.apache.spark.sql.DataFrame

For running this function you must have active spark object and dataframe with headers ON.

answered Apr 29 '19 at 11:37

Akshay Pandya

31
1

This answer wasn't written for the OP's use case, but is the easiest way to accomplish the task. You do need `import org.apache.spark.sql.DataFrameNaFunctions` – Tony Nov 21 '19 at 23:49

score 0 · Answer 6 · edited Jun 25 '20 at 04:11

0

import org.apache.spark.sql.functions._

val base_optin_email = spark.read.option("header","true").option("delimiter",",").schema(schema_base_optin).csv(file_optin_email).where("CPF IS NOT NULL").
        withColumn("CARD_KEY",  lit(translate( translate(col("cpf"), ".", ""),"-","")))

edited Jun 25 '20 at 04:11

Juan Diego Lozano

989
2
18
30

answered Jun 25 '20 at 00:02

Claudio Lisboa

21
1

2

Welcome to Stack Overflow. Code only answers can generally be improved by adding some explanation of how and why they work. When adding an answer to a four year old question with an accepted answer it is very important to point out what new aspect of the question your answer addresses. Please format code as such using the button in the editor for that purpose with the curly braces on it. – Jason Aller Jun 25 '20 at 03:11

Scala: How can I replace value in Dataframes using scala

6 Answers6

Linked