22

I can't figure it out, but guess it's simple. I have a spark dataframe df. This df has columns "A","B" and "C". Now let's say I have an Array containing the name of the columns of this df:

column_names = Array("A","B","C")

I'd like to do a df.select() in such a way, that I can specify which columns not to select. Example: let's say I do not want to select columns "B". I tried

df.select(column_names.filter(_!="B"))

but this does not work, as

org.apache.spark.sql.DataFrame cannot be applied to (Array[String])

So, here it says it should work with a Seq instead. However, trying

df.select(column_names.filter(_!="B").toSeq)

results in

org.apache.spark.sql.DataFrame cannot be applied to (Seq[String]).

What am I doing wrong?

Community
  • 1
  • 1
Blaubaer
  • 654
  • 1
  • 5
  • 15

7 Answers7

41

Since Spark 1.4 you can use drop method:

Scala:

case class Point(x: Int, y: Int)
val df = sqlContext.createDataFrame(Point(0, 0) :: Point(1, 2) :: Nil)
df.drop("y")

Python:

df = sc.parallelize([(0, 0), (1, 2)]).toDF(["x", "y"])
df.drop("y")
## DataFrame[x: bigint]
zero323
  • 322,348
  • 103
  • 959
  • 935
10

I had the same problem and solved it this way (oaffdf is a dataframe):

val dropColNames = Seq("col7","col121")
val featColNames = oaffdf.columns.diff(dropColNames)
val featCols = featColNames.map(cn => org.apache.spark.sql.functions.col(cn))
val featsdf = oaffdf.select(featCols: _*)

https://forums.databricks.com/questions/2808/select-dataframe-columns-from-a-sequence-of-string.html

Edi Bice
  • 566
  • 6
  • 18
5

OK, it's ugly, but this quick spark shell session shows something that works:

scala> val myRDD = sc.parallelize(List.range(1,10))
myRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at parallelize at <console>:21

scala> val myDF = myRDD.toDF("a")
myDF: org.apache.spark.sql.DataFrame = [a: int]

scala> val myOtherRDD = sc.parallelize(List.range(1,10))
myOtherRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at parallelize at <console>:21

scala> val myotherDF = myRDD.toDF("b")
myotherDF: org.apache.spark.sql.DataFrame = [b: int]

scala> myDF.unionAll(myotherDF)
res2: org.apache.spark.sql.DataFrame = [a: int]

scala> myDF.join(myotherDF)
res3: org.apache.spark.sql.DataFrame = [a: int, b: int]

scala> val twocol = myDF.join(myotherDF)
twocol: org.apache.spark.sql.DataFrame = [a: int, b: int]

scala> val cols = Array("a", "b")
cols: Array[String] = Array(a, b)

scala> val selectedCols = cols.filter(_!="b")
selectedCols: Array[String] = Array(a)

scala> twocol.select(selectedCols.head, selectedCols.tail: _*)
res4: org.apache.spark.sql.DataFrame = [a: int]

Providings varargs to a function that requires one is treated in other SO questions. The signature of select is there to ensure your list of selected columns is not empty – which makes the conversion from the list of selected columns to varargs a bit more complex.

Community
  • 1
  • 1
Francois G
  • 11,957
  • 54
  • 59
3

For Spark v1.4 and higher, using drop(*cols) -

Returns a new DataFrame without the specified column(s).

Example -

df.drop('age').collect()

For Spark v2.3 and higher you could also do it using colRegex(colName) -

Selects column based on the column name specified as a regex and returns it as Column.

Example-

df = spark.createDataFrame([("a", 1), ("b", 2), ("c",  3)], ["Col1", "Col2"])
df.select(df.colRegex("`(Col1)?+.+`")).show()

Reference - colRegex, drop


For older versions of Spark, take the list of columns in dataframe, then remove columns you want to drop from it (maybe using set operations) and then use select to pick the resultant list.

Ani Menon
  • 27,209
  • 16
  • 105
  • 126
2
val columns = Seq("A","B","C")

df.select(columns.diff(Seq("B")))
oluies
  • 17,694
  • 14
  • 74
  • 117
1

It is possible to do as following

enter image description here

It uses Spark's ability to select columns using regular expressions. And using negative look-ahead expression ?!

In this case dataframe has columns a,b,c and regex excluding column b from the list.

Notice: you need to enable regexp for column name lookups using spark.sql.parser.quotedRegexColumnNames=true session setting. And requires Spark 2.3+

select `^(?!b).*` 
from (
   select 1 as a, 2 as b, 3 as c
)
Tagar
  • 13,911
  • 6
  • 95
  • 110
  • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/low-quality-posts/29050754) – Majid Hajibaba May 25 '21 at 11:23
  • 1
    updated answer. thanks for the feedback. my original answer was over 5 years ago so I didn't have Spark 2.3 handy at that time :) – Tagar May 25 '21 at 16:30
1

In pyspark you can do

df.select(list(set(df.columns) - set(["B"])))

Using more than one line you can also do

cols = df.columns
cols.remove("B")
df.select(cols)
asmaier
  • 11,132
  • 11
  • 76
  • 103