Spark SQL Map only one column of DataFrame

Question

Sorry for the noob question, I have a dataframe in SparkSQL like this:

id | name | data
----------------
1  | Mary | ABCD
2  | Joey | DOGE
3  | Lane | POOP
4  | Jack | MEGA
5  | Lynn | ARGH

I want to know how to do two things:

1) use a scala function on one or more columns to produce another column 2) use a scala function on one or more columns to replace a column

Examples:

1) Create a new boolean column that tells whether the data starts with A:

id | name | data | startsWithA
------------------------------
1  | Mary | ABCD |        true
2  | Joey | DOGE |       false
3  | Lane | POOP |       false
4  | Jack | MEGA |       false
5  | Lynn | ARGH |        true

2) Replace the data column with its lowercase counterpart:

id | name | data
----------------
1  | Mary | abcd
2  | Joey | doge
3  | Lane | poop
4  | Jack | mega
5  | Lynn | argh

What is the best way to do this in SparkSQL? I've seen many examples of how to return a single transformed column, but I don't know how to get back a new DataFrame with all the original columns as well.

You can use `withColumn` to add columns and keep the existing ones. For examples, see [this post](https://stackoverflow.com/questions/40959655/adding-two-columns-to-existing-dataframe-using-withcolumn). If you use the same name as an existing column, it will overwrite it. — pault, May 08 '18 at 16:32
In pyspark you could do this via: `from pyspark.sql.functions import col, lower; df = df.withColumn('startsWithA', col('data').startswith('A')).withColumn('data', lower(col('data')))`. Should be straightforward to translate into scala. — pault, May 08 '18 at 16:39
@pault Thanks, but do you know how to use this to replace a certain column? Say I wanted to transform my uppercase data into lowercase as in example 2. — user3685285, May 08 '18 at 16:39
Or you can use spark-sql, something like: `"SELECT id, name, LOWER(data) AS data, data LIKE 'A%' AS startsWithA FROM myTable"` — pault, May 08 '18 at 16:42

koiralo · Accepted Answer · 2018-05-08T16:53:41.953

You can use withColumn to add new column or to replace the existing column as

val df = Seq(
 (1, "Mary", "ABCD"),
 (2, "Joey", "DOGE"),
 (3, "Lane", "POOP"),
 (4, "Jack", "MEGA"),
 (5, "Lynn", "ARGH")
).toDF("id", "name", "data")


val resultDF = df.withColumn("startsWithA", $"data".startsWith("A"))
  .withColumn("data", lower($"data"))

If you want separate dataframe then

val resultDF1 = df.withColumn("startsWithA", $"data".startsWith("A"))
val resultDF2 = df.withColumn("data", lower($"data"))

withColumn replaces the old column if the same column name is provided and creates a new column if new column name is provided. Output:

+---+----+----+-----------+
|id |name|data|startsWithA|
+---+----+----+-----------+
|1  |Mary|abcd|true       |
|2  |Joey|doge|false      |
|3  |Lane|poop|false      |
|4  |Jack|mega|false      |
|5  |Lynn|argh|true       |
+---+----+----+-----------+

Spark SQL Map only one column of DataFrame

1 Answers1