38

For example I want to replace all numbers equal to 0.2 in a column to 0. How can I do that in Scala? Thanks

Edit:

|year| make|model| comment            |blank|
|2012|Tesla| S   | No comment         |     | 
|1997| Ford| E350|Go get one now th...|     | 
|2015|Chevy| Volt| null               | null| 

This is my Dataframe I'm trying to change Tesla in make column to S

Javier Montón
  • 4,601
  • 3
  • 21
  • 29
Tong
  • 539
  • 3
  • 7
  • 11

6 Answers6

49

Spark 1.6.2, Java code (sorry), this will change every instance of Tesla to S for the entire dataframe without passing through an RDD:

dataframe.withColumn("make", when(col("make").equalTo("Tesla"), "S")
                             .otherwise(col("make") 
                    );

Edited to add @marshall245 "otherwise" to ensure non-Tesla columns aren't converted to NULL.

ChrisOdney
  • 6,066
  • 10
  • 38
  • 48
Azeroth2b
  • 629
  • 1
  • 5
  • 8
  • hey man, what if i want to change a column with a value from another dataframe column (both dataframes have an id column) i can't seem to make it in java spark. – Vasile Surdu Mar 09 '17 at 16:05
  • This is probably better served with a select .. join on id, given that, sounds like a new question. Hope that gets you started. – Azeroth2b Mar 11 '17 at 18:02
  • Why to edit this one and make it the same answer as @marshall245? – Eduardo Reis Nov 18 '18 at 05:45
  • Where can I find the doc for withColumn function? I actually have more conditions and more columns to change the values of. I got this https://docs.azuredatabricks.net/spark/1.6/sparkr/functions/withColumn.html but this is not helping. Can anyone help? – GadaaDhaariGeek Dec 27 '18 at 06:22
  • That might be a new question. You can chain dataframe commands. For example, you can do dataframe.withcolumn(...).withcolumn(...).. Every call returns a new dataframe, a mutation of the original according to the command. – Azeroth2b Jan 27 '19 at 20:24
30

Building off of the solution from @Azeroth2b. If you want to replace only a couple of items and leave the rest unchanged. Do the following. Without using the otherwise(...) method, the remainder of the column becomes null.

import org.apache.spark.sql.functions._

val newsdf =
  sdf.withColumn(
    "make",
    when(col("make") === "Tesla", "S").otherwise(col("make"))
  );

Old DataFrame

+-----+-----+ 
| make|model| 
+-----+-----+ 
|Tesla|    S| 
| Ford| E350| 
|Chevy| Volt| 
+-----+-----+ 

New Datarame

+-----+-----+
| make|model|
+-----+-----+
|    S|    S|
| Ford| E350|
|Chevy| Volt|
+-----+-----+
Xavier Guihot
  • 54,987
  • 21
  • 291
  • 190
marshall245
  • 301
  • 3
  • 3
15

This can be achieved in dataframes with user defined functions (udf).

import org.apache.spark.sql.functions._
val sqlcont = new org.apache.spark.sql.SQLContext(sc)
val df1 = sqlcont.jsonRDD(sc.parallelize(Array(
      """{"year":2012, "make": "Tesla", "model": "S", "comment": "No Comment", "blank": ""}""",
      """{"year":1997, "make": "Ford", "model": "E350", "comment": "Get one", "blank": ""}""",
      """{"year":2015, "make": "Chevy", "model": "Volt", "comment": "", "blank": ""}"""
    )))

val makeSIfTesla = udf {(make: String) => 
  if(make == "Tesla") "S" else make
}
df1.withColumn("make", makeSIfTesla(df1("make"))).show
Al M
  • 557
  • 4
  • 10
  • 1
    I guess this will improve the performance because you are not converting df to rdd and adding a new column. – Nandakishore Oct 31 '16 at 19:41
  • This does not result in duplicate `make` columns? – WestCoastProjects Jul 24 '18 at 00:01
  • 1
    @javadba no, withColumn will make a new column if it doesn't exist, or replace an existing column.https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.DataFrame – Tony Nov 21 '19 at 23:03
13

Note: As mentionned by Olivier Girardot, this answer is not optimized and the withColumn solution is the one to use (Azeroth2b answer)

Can not delete this answer as it has been accepted


Here is my take on this one:

 val rdd = sc.parallelize(
      List( (2012,"Tesla","S"), (1997,"Ford","E350"), (2015,"Chevy","Volt"))
  )
  val sqlContext = new SQLContext(sc)

  // this is used to implicitly convert an RDD to a DataFrame.
  import sqlContext.implicits._

  val dataframe = rdd.toDF()

  dataframe.foreach(println)

 dataframe.map(row => {
    val row1 = row.getAs[String](1)
    val make = if (row1.toLowerCase == "tesla") "S" else row1
    Row(row(0),make,row(2))
  }).collect().foreach(println)

//[2012,S,S]
//[1997,Ford,E350]
//[2015,Chevy,Volt]

You can actually use directly map on the DataFrame.

So you basically check the column 1 for the String tesla. If it's tesla, use the value S for make else you the current value of column 1

Then build a tuple with all data from the row using the indexes (zero based) (Row(row(0),make,row(2))) in my example)

There is probably a better way to do it. I am not that familiar yet with the Spark umbrella

ccheneson
  • 49,072
  • 8
  • 63
  • 68
  • Thanks for your help. I have one more question. Your solution can printout the strings I want. However what if I want to change the value within the dataframe itself? When I do dataframe.show() the the value is still tesla – Tong Sep 02 '15 at 20:38
  • 2
    Dataframe are based on RDDs which are immutable. Try `val newDF = dataframe.map(row => { val row1 = row.getAs[String](1) val make = if (row1.toLowerCase == "tesla") "S" else row1 Row(row(0),make,row(2)) })` that should construct new DataFrame. – ccheneson Sep 02 '15 at 21:04
  • Thanks! It works! Feels so good! I set a new data frame and add a new column. – Tong Sep 03 '15 at 01:07
  • Hi! First thanks for solving my problem. Can I convert a DataFrame to RDD only by .rdd? Is there any risk like changing the schema? Thanks again! – Tong Sep 04 '15 at 14:21
  • I know only this way to convert to rdd. I dont know for the schema. You should post another question so that other people will see it and answer – ccheneson Sep 04 '15 at 15:09
  • Thanks! The .rdd works pretty good. Do you know how can I convert a rdd back to DataFrame? Really appreciate that you answer so many questions for me. – Tong Sep 04 '15 at 15:42
  • Use `toDF()` from `RDD` to `DataFrame` – ccheneson Sep 04 '15 at 15:45
  • I tried but it said value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]. – Tong Sep 04 '15 at 16:00
  • From the `sqlContext`, you can use the method `def createDataFrame(rowRDD: RDD[Row], schema: StructType): DataFrame` . – ccheneson Sep 05 '15 at 20:44
  • To generate a Dataframe: `import sqlContext.implicits._; dataframe.map(row => {val make = row.getAs[String]("make");(row.getAs[Int]("year"), if (make == "tesla") "S" else make,row.getAs[String]("model"), ...)}).toDF("year","make","model", ...)` – leo9r Sep 22 '16 at 05:06
  • 4
    this will break spark's catalyst optimisations, and therefore is not the best practice, the withColumn approach is best suited for this. – Olivier Girardot Jan 05 '17 at 19:29
3

df2.na.replace("Name",Map("John" -> "Akshay","Cindy" -> "Jayita")).show()

replace in class DataFrameNaFunctions of type [T](col: String, replacement: Map[T,T])org.apache.spark.sql.DataFrame

For running this function you must have active spark object and dataframe with headers ON.

  • This answer wasn't written for the OP's use case, but is the easiest way to accomplish the task. You do need `import org.apache.spark.sql.DataFrameNaFunctions` – Tony Nov 21 '19 at 23:49
0
import org.apache.spark.sql.functions._

val base_optin_email = spark.read.option("header","true").option("delimiter",",").schema(schema_base_optin).csv(file_optin_email).where("CPF IS NOT NULL").
        withColumn("CARD_KEY",  lit(translate( translate(col("cpf"), ".", ""),"-","")))
Juan Diego Lozano
  • 989
  • 2
  • 18
  • 30
  • 2
    Welcome to Stack Overflow. Code only answers can generally be improved by adding some explanation of how and why they work. When adding an answer to a four year old question with an accepted answer it is very important to point out what new aspect of the question your answer addresses. Please format code as such using the button in the editor for that purpose with the curly braces on it. – Jason Aller Jun 25 '20 at 03:11