15

I have dataframe in which I have about 1000s ( variable) columns.

I want to make all values upper case.

Here is the approach I have thought of , can you suggest if this is best way.

  • Take row
  • Find schema and store in array and find how many fields are there.
  • map through each row in data frame and upto limit of number of elements in array
  • apply function to upper case each fields and return row
zero323
  • 322,348
  • 103
  • 959
  • 935
user2230605
  • 2,390
  • 6
  • 27
  • 45

2 Answers2

40

If you simply want to apply the same functions to all columns something like this should be enough:

import org.apache.spark.sql.functions.{col, upper}

val df = sc.parallelize(
  Seq(("a", "B", "c"), ("D", "e", "F"))).toDF("x", "y", "z")
df.select(df.columns.map(c => upper(col(c)).alias(c)): _*).show

// +---+---+---+
// |  x|  y|  z|
// +---+---+---+
// |  A|  B|  C|
// |  D|  E|  F|
// +---+---+---+

or in Python

from pyspark.sql.functions import col, upper

df = sc.parallelize([("a", "B", "c"), ("D", "e", "F")]).toDF(("x", "y", "z"))
df.select(*(upper(col(c)).alias(c) for c in df.columns)).show()

##  +---+---+---+
##  |  x|  y|  z|
##  +---+---+---+
##  |  A|  B|  C|
##  |  D|  E|  F|
##  +---+---+---+

See also: SparkSQL: apply aggregate functions to a list of column

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935
  • 1
    Thanks what is this doing in English .alias(c)): _* – user2230605 Dec 02 '15 at 09:20
  • 2
    `alias` sets a name for the column. `:_*` denotes varargs syntax in Scala. In other words it passes each element of the sequence as an argument for `select`. – zero323 Dec 02 '15 at 09:27
  • getting this error ```File "", line 1 pivoted.select(pivoted.columns.map(c => encodeUDF(col(c)).alias(c)): _*).show(2) ^ SyntaxError: invalid syntax``` – Tracy Jun 01 '20 at 13:59
3

I needed to do similar but had to write my own function to convert empty strings within a dataframe to null. This is what I did.

import org.apache.spark.sql.functions.{col, udf} 
import spark.implicits._ 

def emptyToNull(_str: String): Option[String] = {
  _str match {
    case d if (_str == null || _str.trim.isEmpty) => None
    case _ => Some(_str)
  }
}
val emptyToNullUdf = udf(emptyToNull(_: String))

val df = Seq(("a", "B", "c"), ("D", "e ", ""), ("", "", null)).toDF("x", "y", "z")
df.select(df.columns.map(c => emptyToNullUdf(col(c)).alias(c)): _*).show

+----+----+----+
|   x|   y|   z|
+----+----+----+
|   a|   B|   c|
|   D|  e |null|
|null|null|null|
+----+----+----+

Here's a more refined function of emptyToNull using options instead of null.

def emptyToNull(_str: String): Option[String] = Option(_str) match {
  case ret @ Some(s) if (s.trim.nonEmpty) => ret
  case _ => None
}
Pika Supports Ukraine
  • 3,612
  • 10
  • 26
  • 42
Tony Fraser
  • 727
  • 7
  • 14