Unpacking a list to select multiple columns from a spark data frame

Question

I have a spark data frame df. Is there a way of sub selecting a few columns using a list of these columns?

scala> df.columns
res0: Array[String] = Array("a", "b", "c", "d")

I know I can do something like df.select("b", "c"). But suppose I have a list containing a few column names val cols = List("b", "c"), is there a way to pass this to df.select? df.select(cols) throws an error. Something like df.select(*cols) as in python

score 103 · Accepted Answer · edited Oct 29 '17 at 21:31

103

Use df.select(cols.head, cols.tail: _*)

Let me know if it works :)

Explanation from @Ben:

The key is the method signature of select:

select(col: String, cols: String*)

The cols:String* entry takes a variable number of arguments. :_* unpacks arguments so that they can be handled by this argument. Very similar to unpacking in python with *args. See here and here for other examples.

edited Oct 29 '17 at 21:31

MaxU - stand with Ukraine

205,989
36
386
419

answered Jan 22 '16 at 04:15

Shagun Sodhani

3,535
4
30
41

1

Thanks! Worked like a charm. Could explain a bit more about the syntax? Specifically what does `col.tail: _ *` do? – Ben Jan 22 '16 at 14:12
14

I think I understand now. The key is the method signature of select `select(col: String, cols: String*)`. The `cols:String*` entry takes a variable number of arguments. `:_*` unpacks arguments so that they can be handled by this argument. Very similar to unpacking in python with `*args`. See [here](http://stackoverflow.com/a/1660768/4096199) and [here](http://stackoverflow.com/questions/6051302/what-does-colon-underscore-star-do-in-scala) for other examples. – Ben Jan 22 '16 at 14:38
Cool! You got it right :) Sorry I got both the notifications just now so couldn't reply earlier. :) – Shagun Sodhani Jan 22 '16 at 14:41
No problem. Thanks again! – Ben Jan 22 '16 at 14:42

Kshitij Kulshrestha · Answer 2 · 2016-07-12T05:58:39.547

33

You can typecast String to spark column like this:

import org.apache.spark.sql.functions._
df.select(cols.map(col): _*)

edited Jul 12 '16 at 05:58

answered Jun 09 '16 at 11:45

Kshitij Kulshrestha

2,032
1
20
27

can you elaborate plz ? – Olfa2 Apr 13 '22 at 11:35

score 25 · Answer 3 · answered Oct 01 '16 at 20:33

25

Another option that I've just learnt.

import org.apache.spark.sql.functions.col
val columns = Seq[String]("col1", "col2", "col3")
val colNames = columns.map(name => col(name))
val df = df.select(colNames:_*)

answered Oct 01 '16 at 20:33

vEdwardpc

351
3
3

score 3 · Answer 4 · answered Mar 27 '19 at 06:51

First convert the String Array to a List of Spark dataset Column type as below

String[] strColNameArray = new String[]{"a", "b", "c", "d"};

List<Column> colNames = new ArrayList<>();

for(String strColName : strColNameArray){
    colNames.add(new Column(strColName));
}

then convert the List using JavaConversions functions within the select statement as below. You need the following import statement.

import scala.collection.JavaConversions;

Dataset<Row> selectedDF = df.select(JavaConversions.asScalaBuffer(colNames ));

raam86 · Answer 5 · 2017-01-16T13:41:07.420

2

You can pass arguments of type Column* to select:

val df = spark.read.json("example.json")
val cols: List[String] = List("a", "b")
//convert string to Column
val col: List[Column] = cols.map(df(_))
df.select(col:_*)

edited Jan 16 '17 at 13:41

answered Jan 16 '17 at 13:15

raam86

6,785
2
31
46

1

What about a bit shorter version: `df.select(cols.map(df(_)): _*)` ? – MaxU - stand with Ukraine Oct 29 '17 at 21:33

score 2 · Answer 6 · answered May 10 '18 at 06:21

you can do like this

String[] originCols = ds.columns();
ds.selectExpr(originCols)

spark selectExp source code

     /**
   * Selects a set of SQL expressions. This is a variant of `select` that accepts
   * SQL expressions.
   *
   * {{{
   *   // The following are equivalent:
   *   ds.selectExpr("colA", "colB as newName", "abs(colC)")
   *   ds.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
   * }}}
   *
   * @group untypedrel
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def selectExpr(exprs: String*): DataFrame = {
    select(exprs.map { expr =>
      Column(sparkSession.sessionState.sqlParser.parseExpression(expr))
    }: _*)
  }

score 2 · Answer 7 · answered Feb 13 '19 at 04:30

2

Yes , You can make use of .select in scala.

Use .head and .tail to select the whole values mentioned in the List()

Example

val cols = List("b", "c")
df.select(cols.head,cols.tail: _*)

Explanation

answered Feb 13 '19 at 04:30

USB

6,019
15
62
93

Can you please share how to do the same(pass the column names) in java while doing dataframeResult = inpDataframe.select("col1","col2",....) – user1326784 Mar 13 '20 at 21:11

score 1 · Answer 8 · edited Dec 07 '22 at 06:56

1

Prepare a list where all the requirement features are listed then use spark inbuilt function using *, reference given below.

lst = ["col1", "col2", "col3"]
result = df.select(*lst)

Some time we get an error of:" Analysis Exception: cannot resolve 'col1' given input columns" try to convert features to string type as mentioned below:

from pyspark.sql.functions import lit
from pyspark.sql.types import StringType
for i in lst:
   if i not in df.columns:
      df = df.withColumn(i, lit(None).cast(StringType()))

And finally you will get the dataset with required features.

edited Dec 07 '22 at 06:56

SARVESH

40
5

answered Jun 02 '22 at 09:25

Hrushi

409
2
10

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – lemon Jun 02 '22 at 14:26

Unpacking a list to select multiple columns from a spark data frame

8 Answers8

Linked

Related