14

Trying to drop a column in a DataFrame, but i have column names with dots in them, which I escaped.

Before I escape, my schema looks like this:

root
 |-- user_id: long (nullable = true)
 |-- hourOfWeek: string (nullable = true)
 |-- observed: string (nullable = true)
 |-- raw.hourOfDay: long (nullable = true)
 |-- raw.minOfDay: long (nullable = true)
 |-- raw.dayOfWeek: long (nullable = true)
 |-- raw.sensor2: long (nullable = true)

If I try to drop a column, I get:

df = df.drop("hourOfWeek")
org.apache.spark.sql.AnalysisException: cannot resolve 'raw.hourOfDay' given input columns raw.dayOfWeek, raw.sensor2, observed, raw.hourOfDay, hourOfWeek, raw.minOfDay, user_id;
        at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
        at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)

Note that I'm not even trying to drop on the columns with dots in name. Since I couldn't seem to do much without escaping the column names, I converted the schema to:

root
 |-- user_id: long (nullable = true)
 |-- hourOfWeek: string (nullable = true)
 |-- observed: string (nullable = true)
 |-- `raw.hourOfDay`: long (nullable = true)
 |-- `raw.minOfDay`: long (nullable = true)
 |-- `raw.dayOfWeek`: long (nullable = true)
 |-- `raw.sensor2`: long (nullable = true)

but that doesn't seem to help. I still get the same error.

I tried escaping all column names, and drop using the escaped name, but that doesn't work either.

root
 |-- `user_id`: long (nullable = true)
 |-- `hourOfWeek`: string (nullable = true)
 |-- `observed`: string (nullable = true)
 |-- `raw.hourOfDay`: long (nullable = true)
 |-- `raw.minOfDay`: long (nullable = true)
 |-- `raw.dayOfWeek`: long (nullable = true)
 |-- `raw.sensor2`: long (nullable = true)

df.drop("`hourOfWeek`")
org.apache.spark.sql.AnalysisException: cannot resolve 'user_id' given input columns `user_id`, `raw.dayOfWeek`, `observed`, `raw.minOfDay`, `raw.hourOfDay`, `raw.sensor2`, `hourOfWeek`;
        at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)

Is there another way to drop a column that would not fail on this type of data?

MrE
  • 19,584
  • 12
  • 87
  • 105

2 Answers2

29

Alright, I seem to have found the solution after all:

df.drop(df.col("raw.hourOfWeek")) seems to work

MrE
  • 19,584
  • 12
  • 87
  • 105
  • Useful answer. But I have one more similar question. Assume that I have around 100 columns in a Spark Dataframe. Is there any way to select only few columns from this dataframe and create another dataframe with those selected columns ? something like df2 = df1.select(df.col("col1", "col2")) – JKC Sep 05 '17 at 05:46
  • i think this https://stackoverflow.com/questions/36131716/scala-spark-dataframe-dataframe-select-multiple-columns-given-a-sequence-of-co answers your question – MrE Sep 05 '17 at 23:38
5
val data = df.drop("Customers");

will work fine for normal columns

val new = df.drop(df.col("old.column"));
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459