How to "negative select" columns in spark's dataframe

Question

I can't figure it out, but guess it's simple. I have a spark dataframe df. This df has columns "A","B" and "C". Now let's say I have an Array containing the name of the columns of this df:

column_names = Array("A","B","C")

I'd like to do a df.select() in such a way, that I can specify which columns not to select. Example: let's say I do not want to select columns "B". I tried

df.select(column_names.filter(_!="B"))

but this does not work, as

org.apache.spark.sql.DataFrame cannot be applied to (Array[String])

So, here it says it should work with a Seq instead. However, trying

df.select(column_names.filter(_!="B").toSeq)

results in

org.apache.spark.sql.DataFrame cannot be applied to (Seq[String]).

What am I doing wrong?

zero323 · Accepted Answer · 2016-05-11T19:00:37.683

41

Since Spark 1.4 you can use drop method:

Scala:

case class Point(x: Int, y: Int)
val df = sqlContext.createDataFrame(Point(0, 0) :: Point(1, 2) :: Nil)
df.drop("y")

Python:

df = sc.parallelize([(0, 0), (1, 2)]).toDF(["x", "y"])
df.drop("y")
## DataFrame[x: bigint]

edited May 11 '16 at 19:00

answered Jul 15 '15 at 16:15

zero323

322,348
103
959
935

Thanks! I was not aware of this -- but then also I am still on spark 1.3. – Blaubaer Jul 15 '15 at 16:24
I wonder if there any method to drop by column locations, something such as `df.drop(1)` or `df.drop(col(1))` – David Arenburg Aug 19 '16 at 11:57
@DavidArenburg `df.drop(df.columns(i))`? – zero323 Aug 19 '16 at 13:43
Thanks! Good idea, though it needs square brackets as in `df.drop(df.columns[i])` – David Arenburg Aug 21 '16 at 06:43
@DavidArenburg Python after all? :) – zero323 Aug 21 '16 at 11:11
Yeah, I had no choice cause had to get to production fast and couldn't find enough docs in Scala :( I intend to move back to Scala within next several months or so. – David Arenburg Aug 21 '16 at 16:33
@DavidArenburg :) In Python you can actually use `df.drop(df[i])` but Scala doesn't support that. – zero323 Aug 22 '16 at 13:06

score 10 · Answer 2 · answered May 11 '16 at 17:59

I had the same problem and solved it this way (oaffdf is a dataframe):

val dropColNames = Seq("col7","col121")
val featColNames = oaffdf.columns.diff(dropColNames)
val featCols = featColNames.map(cn => org.apache.spark.sql.functions.col(cn))
val featsdf = oaffdf.select(featCols: _*)

https://forums.databricks.com/questions/2808/select-dataframe-columns-from-a-sequence-of-string.html

score 5 · Answer 3 · edited May 23 '17 at 12:25

OK, it's ugly, but this quick spark shell session shows something that works:

scala> val myRDD = sc.parallelize(List.range(1,10))
myRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at parallelize at <console>:21

scala> val myDF = myRDD.toDF("a")
myDF: org.apache.spark.sql.DataFrame = [a: int]

scala> val myOtherRDD = sc.parallelize(List.range(1,10))
myOtherRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at parallelize at <console>:21

scala> val myotherDF = myRDD.toDF("b")
myotherDF: org.apache.spark.sql.DataFrame = [b: int]

scala> myDF.unionAll(myotherDF)
res2: org.apache.spark.sql.DataFrame = [a: int]

scala> myDF.join(myotherDF)
res3: org.apache.spark.sql.DataFrame = [a: int, b: int]

scala> val twocol = myDF.join(myotherDF)
twocol: org.apache.spark.sql.DataFrame = [a: int, b: int]

scala> val cols = Array("a", "b")
cols: Array[String] = Array(a, b)

scala> val selectedCols = cols.filter(_!="b")
selectedCols: Array[String] = Array(a)

scala> twocol.select(selectedCols.head, selectedCols.tail: _*)
res4: org.apache.spark.sql.DataFrame = [a: int]

Providings varargs to a function that requires one is treated in other SO questions. The signature of select is there to ensure your list of selected columns is not empty – which makes the conversion from the list of selected columns to varargs a bit more complex.

score 3 · Answer 4 · answered May 20 '20 at 16:57

For Spark v1.4 and higher, using drop(*cols) -

Returns a new DataFrame without the specified column(s).

Example -

df.drop('age').collect()

For Spark v2.3 and higher you could also do it using colRegex(colName) -

Selects column based on the column name specified as a regex and returns it as Column.

Example-

df = spark.createDataFrame([("a", 1), ("b", 2), ("c",  3)], ["Col1", "Col2"])
df.select(df.colRegex("`(Col1)?+.+`")).show()

Reference - colRegex, drop

For older versions of Spark, take the list of columns in dataframe, then remove columns you want to drop from it (maybe using set operations) and then use select to pick the resultant list.

score 2 · Answer 5 · answered Apr 09 '16 at 23:14

2

val columns = Seq("A","B","C")

df.select(columns.diff(Seq("B")))

answered Apr 09 '16 at 23:14

oluies

17,694
14
74
117

Tagar · Answer 6 · 2021-05-25T16:29:35.127

1

It is possible to do as following

It uses Spark's ability to select columns using regular expressions. And using negative look-ahead expression ?!

In this case dataframe has columns a,b,c and regex excluding column b from the list.

Notice: you need to enable regexp for column name lookups using spark.sql.parser.quotedRegexColumnNames=true session setting. And requires Spark 2.3+

select `^(?!b).*` 
from (
   select 1 as a, 2 as b, 3 as c
)

edited May 25 '21 at 16:29

answered Mar 21 '16 at 08:42

Tagar

13,911
6
95
110

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/low-quality-posts/29050754) – Majid Hajibaba May 25 '21 at 11:23
1

updated answer. thanks for the feedback. my original answer was over 5 years ago so I didn't have Spark 2.3 handy at that time :) – Tagar May 25 '21 at 16:30

score 1 · Answer 7 · answered Oct 19 '17 at 08:59

1

In pyspark you can do

df.select(list(set(df.columns) - set(["B"])))

Using more than one line you can also do

cols = df.columns
cols.remove("B")
df.select(cols)

answered Oct 19 '17 at 08:59

asmaier

11,132
11
76
103

why not use .drop() ? – justin cress Sep 04 '19 at 17:42

How to "negative select" columns in spark's dataframe

7 Answers7

Linked

Related