4

Sorry for the noob question, I have a dataframe in SparkSQL like this:

id | name | data
----------------
1  | Mary | ABCD
2  | Joey | DOGE
3  | Lane | POOP
4  | Jack | MEGA
5  | Lynn | ARGH

I want to know how to do two things:

1) use a scala function on one or more columns to produce another column 2) use a scala function on one or more columns to replace a column

Examples:

1) Create a new boolean column that tells whether the data starts with A:

id | name | data | startsWithA
------------------------------
1  | Mary | ABCD |        true
2  | Joey | DOGE |       false
3  | Lane | POOP |       false
4  | Jack | MEGA |       false
5  | Lynn | ARGH |        true

2) Replace the data column with its lowercase counterpart:

id | name | data
----------------
1  | Mary | abcd
2  | Joey | doge
3  | Lane | poop
4  | Jack | mega
5  | Lynn | argh

What is the best way to do this in SparkSQL? I've seen many examples of how to return a single transformed column, but I don't know how to get back a new DataFrame with all the original columns as well.

user3685285
  • 6,066
  • 13
  • 54
  • 95
  • 1
    You can use `withColumn` to add columns and keep the existing ones. For examples, see [this post](https://stackoverflow.com/questions/40959655/adding-two-columns-to-existing-dataframe-using-withcolumn). If you use the same name as an existing column, it will overwrite it. – pault May 08 '18 at 16:32
  • 1
    In pyspark you could do this via: `from pyspark.sql.functions import col, lower; df = df.withColumn('startsWithA', col('data').startswith('A')).withColumn('data', lower(col('data')))`. Should be straightforward to translate into scala. – pault May 08 '18 at 16:39
  • @pault Thanks, but do you know how to use this to replace a certain column? Say I wanted to transform my uppercase data into lowercase as in example 2. – user3685285 May 08 '18 at 16:39
  • Or you can use spark-sql, something like: `"SELECT id, name, LOWER(data) AS data, data LIKE 'A%' AS startsWithA FROM myTable"` – pault May 08 '18 at 16:42

1 Answers1

4

You can use withColumn to add new column or to replace the existing column as

val df = Seq(
 (1, "Mary", "ABCD"),
 (2, "Joey", "DOGE"),
 (3, "Lane", "POOP"),
 (4, "Jack", "MEGA"),
 (5, "Lynn", "ARGH")
).toDF("id", "name", "data")


val resultDF = df.withColumn("startsWithA", $"data".startsWith("A"))
  .withColumn("data", lower($"data"))

If you want separate dataframe then

val resultDF1 = df.withColumn("startsWithA", $"data".startsWith("A"))
val resultDF2 = df.withColumn("data", lower($"data"))

withColumn replaces the old column if the same column name is provided and creates a new column if new column name is provided. Output:

+---+----+----+-----------+
|id |name|data|startsWithA|
+---+----+----+-----------+
|1  |Mary|abcd|true       |
|2  |Joey|doge|false      |
|3  |Lane|poop|false      |
|4  |Jack|mega|false      |
|5  |Lynn|argh|true       |
+---+----+----+-----------+
koiralo
  • 22,594
  • 6
  • 51
  • 72