60

I have a spark data frame df. Is there a way of sub selecting a few columns using a list of these columns?

scala> df.columns
res0: Array[String] = Array("a", "b", "c", "d")

I know I can do something like df.select("b", "c"). But suppose I have a list containing a few column names val cols = List("b", "c"), is there a way to pass this to df.select? df.select(cols) throws an error. Something like df.select(*cols) as in python

SARVESH
  • 40
  • 5
Ben
  • 4,774
  • 5
  • 22
  • 26

8 Answers8

103

Use df.select(cols.head, cols.tail: _*)

Let me know if it works :)

Explanation from @Ben:

The key is the method signature of select:

select(col: String, cols: String*)

The cols:String* entry takes a variable number of arguments. :_* unpacks arguments so that they can be handled by this argument. Very similar to unpacking in python with *args. See here and here for other examples.

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
Shagun Sodhani
  • 3,535
  • 4
  • 30
  • 41
  • 1
    Thanks! Worked like a charm. Could explain a bit more about the syntax? Specifically what does `col.tail: _ *` do? – Ben Jan 22 '16 at 14:12
  • 14
    I think I understand now. The key is the method signature of select `select(col: String, cols: String*)`. The `cols:String*` entry takes a variable number of arguments. `:_*` unpacks arguments so that they can be handled by this argument. Very similar to unpacking in python with `*args`. See [here](http://stackoverflow.com/a/1660768/4096199) and [here](http://stackoverflow.com/questions/6051302/what-does-colon-underscore-star-do-in-scala) for other examples. – Ben Jan 22 '16 at 14:38
  • Cool! You got it right :) Sorry I got both the notifications just now so couldn't reply earlier. :) – Shagun Sodhani Jan 22 '16 at 14:41
  • No problem. Thanks again! – Ben Jan 22 '16 at 14:42
33

You can typecast String to spark column like this:

import org.apache.spark.sql.functions._
df.select(cols.map(col): _*)
Kshitij Kulshrestha
  • 2,032
  • 1
  • 20
  • 27
25

Another option that I've just learnt.

import org.apache.spark.sql.functions.col
val columns = Seq[String]("col1", "col2", "col3")
val colNames = columns.map(name => col(name))
val df = df.select(colNames:_*)
vEdwardpc
  • 351
  • 3
  • 3
3

First convert the String Array to a List of Spark dataset Column type as below

String[] strColNameArray = new String[]{"a", "b", "c", "d"};

List<Column> colNames = new ArrayList<>();

for(String strColName : strColNameArray){
    colNames.add(new Column(strColName));
}

then convert the List using JavaConversions functions within the select statement as below. You need the following import statement.

import scala.collection.JavaConversions;

Dataset<Row> selectedDF = df.select(JavaConversions.asScalaBuffer(colNames ));
Eranga Atugoda
  • 131
  • 1
  • 6
2

You can pass arguments of type Column* to select:

val df = spark.read.json("example.json")
val cols: List[String] = List("a", "b")
//convert string to Column
val col: List[Column] = cols.map(df(_))
df.select(col:_*)
raam86
  • 6,785
  • 2
  • 31
  • 46
2

you can do like this

String[] originCols = ds.columns();
ds.selectExpr(originCols)

spark selectExp source code

     /**
   * Selects a set of SQL expressions. This is a variant of `select` that accepts
   * SQL expressions.
   *
   * {{{
   *   // The following are equivalent:
   *   ds.selectExpr("colA", "colB as newName", "abs(colC)")
   *   ds.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
   * }}}
   *
   * @group untypedrel
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def selectExpr(exprs: String*): DataFrame = {
    select(exprs.map { expr =>
      Column(sparkSession.sessionState.sqlParser.parseExpression(expr))
    }: _*)
  }
geosmart
  • 518
  • 4
  • 15
2

Yes , You can make use of .select in scala.

Use .head and .tail to select the whole values mentioned in the List()

Example

val cols = List("b", "c")
df.select(cols.head,cols.tail: _*)

Explanation

USB
  • 6,019
  • 15
  • 62
  • 93
  • Can you please share how to do the same(pass the column names) in java while doing dataframeResult = inpDataframe.select("col1","col2",....) – user1326784 Mar 13 '20 at 21:11
1

Prepare a list where all the requirement features are listed then use spark inbuilt function using *, reference given below.

lst = ["col1", "col2", "col3"]
result = df.select(*lst)

Some time we get an error of:" Analysis Exception: cannot resolve 'col1' given input columns" try to convert features to string type as mentioned below:

from pyspark.sql.functions import lit
from pyspark.sql.types import StringType
for i in lst:
   if i not in df.columns:
      df = df.withColumn(i, lit(None).cast(StringType()))

And finally you will get the dataset with required features.

SARVESH
  • 40
  • 5
Hrushi
  • 409
  • 2
  • 10
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – lemon Jun 02 '22 at 14:26