40
val columnName=Seq("col1","col2",....."coln");

Is there a way to do dataframe.select operation to get dataframe containing only the column names specified . I know I can do dataframe.select("col1","col2"...) but the columnNameis generated at runtime. I could do dataframe.select() repeatedly for each column name in a loop.Will it have any performance overheads?. Is there any other simpler way to accomplish this?

zero323
  • 322,348
  • 103
  • 959
  • 935
Himaprasoon
  • 2,609
  • 3
  • 25
  • 46

3 Answers3

81
val columnNames = Seq("col1","col2",....."coln")

// using the string column names:
val result = dataframe.select(columnNames.head, columnNames.tail: _*)

// or, equivalently, using Column objects:
val result = dataframe.select(columnNames.map(c => col(c)): _*)
Tzach Zohar
  • 37,442
  • 3
  • 79
  • 85
  • 6
    `tail` returns the sequence excluding the first item (`head`); `: _*` transforms a collection into a vararg argument - used when calling a method expecting a vararg, like select does: `def select(col: String, cols: String*)` – Tzach Zohar Mar 21 '16 at 13:15
  • 1
    It's called, repeated parameters, you can check more about it [here](http://www.scala-lang.org/docu/files/ScalaReference.pdf) - chapter 4 section 2. – eliasah Mar 21 '16 at 13:18
  • 1
    @V.Samma that won't compile, check the signatures of `select` - it's either `select(col: String, cols: String*): DataFrame` for Strings, or `select(cols: Column*): DataFrame` for Columns, there's no `select(cols: String*): DataFrame`. See http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset – Tzach Zohar Nov 10 '16 at 08:54
  • Is there a way to add alias for other columns like this? dataframe.select(columnNames.head, columnNames.tail: _*, col("abc").as("def")) ? – Deepak Kumar Nov 08 '22 at 17:43
8

Since dataFrame.select() expects a sequence of columns and we have a sequence of strings, we need to convert our sequence to a List of cols and convert that list to the sequence. columnName.map(name => col(name)): _* gives a sequence of columns from a sequence of strings, and this can be passed as a parameter to select():

  val columnName = Seq("col1", "col2")
  val DFFiltered = DF.select(columnName.map(name => col(name)): _*)
JamCon
  • 2,313
  • 2
  • 25
  • 34
ankursingh1000
  • 1,349
  • 1
  • 15
  • 21
  • 1
    Please add some context and explanation to this answer. – F_SO_K May 15 '18 at 09:04
  • @UserszrKs i am using spark 2.3.1 version , when i use the above it is giving an error .."type mismatch : found: org.apache.spark.sql.Column , required :Seq[?] , What is wrong here? – BdEngineer Dec 10 '18 at 08:14
-1

Alternatively, you can also write like this

val columnName = Seq("col1", "col2")
  val DFFiltered = DF.select(columnName.map(DF(_): _*)
Shubham Gupta
  • 317
  • 3
  • 5