Column name with dot spark

Question

I am trying to take columns from a DataFrame and convert it to an RDD[Vector].

The problem is that I have columns with a "dot" in their name as the following dataset :

"col0.1","col1.2","col2.3","col3.4"
1,2,3,4
10,12,15,3
1,12,10,5

This is what I'm doing :

val df = spark.read.format("csv").options(Map("header" -> "true", "inferSchema" -> "true")).load("C:/Users/mhattabi/Desktop/donnee/test.txt")
val column=df.columns.map(c=>s"`${c}`")
val rows = new VectorAssembler().setInputCols(column).setOutputCol("vs")
  .transform(df)
  .select("vs")
  .rdd
val data =rows.map(_.getAs[org.apache.spark.ml.linalg.Vector](0))
  .map(org.apache.spark.mllib.linalg.Vectors.fromML)

val mat: RowMatrix = new RowMatrix(data)
//// Compute the top 5 singular values and corresponding singular vectors.
val svd: SingularValueDecomposition[RowMatrix, Matrix] = mat.computeSVD(mat.numCols().toInt, computeU = true)
val U: RowMatrix = svd.U  // The U factor is a RowMatrix.
val s: Vector = svd.s  // The singular values are stored in a local dense vector.
val V: Matrix = svd.V  // The V factor is a local dense matrix.

println(V)

Please any help to get me consider columns with dot in their names.Thanks

@RameshMaharjan it worked with columns witout dot, but i need to fix it with dots , any help thnx — Maher HTB, Jun 05 '17 at 11:32
What I would suggest is to save the schema with dots, change the column names and after you finish with new column names change it back with dots. Doesn't that work? — Ramesh Maharjan, Jun 05 '17 at 11:34

score 65 · Answer 1 · answered Jul 19 '18 at 19:18

65

If your problem is the .(dot) in the column name, you could use `(backticks) to enclose the column name.

df.select("`col0.1`")

answered Jul 19 '18 at 19:18

Ricardo Mutti

2,639
2
19
20

3

Take care however that drop function behaves well with no backticks. In this case: df.drop("col0.1") – devilpreet Aug 23 '18 at 19:28
@devilpreet As well as withColumn does work ok with "dotted" names. Apparently only referencing such a column in expressions needs to be escaped with backticks. – Puterdo Borato May 13 '21 at 08:43
2

Note this also works: `df.select("parent_column.\`child.column.with.dots.i.dont.want.expanded\`")` – Wassadamo Aug 16 '21 at 21:46
@Wassadamo further selection from the dataframe using the same syntax does not work. e.g i have this, notice the complaints about the column name is actually in the suggested list.``` AnalysisException: Column 'USER_DEFINED_EXTENSIONS.`UDX.EDXF.REACH`.`UDX.EDXF.REACH.LISTDATE`' does not exist. Did you mean one of the following? [USER_DEFINED_EXTENSIONS.`UDX.EDXF.REACH`.`UDX.EDXF.REACH.LISTDATE` ``` – soMuchToLearnAndShare Jul 20 '22 at 09:44

zero323 · Answer 2 · 2017-06-05T11:38:42.230

13

The problem here is VectorAssembler implementation, not the columns per se. You can for example skip the header:

val df = spark.read.format("csv")
  .options(Map("inferSchema" -> "true", "comment" -> "\""))
  .load(path)

new VectorAssembler()
  .setInputCols(df.columns)
  .setOutputCol("vs")
  .transform(df)

or rename columns before passing to VectorAssembler:

val renamed =  df.toDF(df.columns.map(_.replace(".", "_")): _*)

new VectorAssembler()
  .setInputCols(renamed.columns)
  .setOutputCol("vs")
  .transform(renamed)

Finally the best approach is to provide schema explicitly:

import org.apache.spark.sql.types._

val schema = StructType((0 until 4).map(i => StructField(s"_$i", DoubleType)))

val dfExplicit = spark.read.format("csv")
  .options(Map("header" -> "true"))
  .schema(schema)
  .load(path)

new VectorAssembler()
  .setInputCols(dfExplicit.columns)
  .setOutputCol("vs")
  .transform(dfExplicit)

edited Jun 05 '17 at 11:38

answered Jun 05 '17 at 11:35

zero323

322,348
103
959
935

1

Subject for a JIRA ? – eliasah Jun 05 '17 at 11:38
@zero323 i appreciate your help , you mean there is no other solution to handle this while keeping the same columns name ? thanks – Maher HTB Jun 05 '17 at 11:59
4

None that I am aware off, but I never intensively looked for a one. In general you shouldn't use anything else than `[a-zA-Z0-9_]` for the column names to keep reasonable compatibility with other tools (including Parquet). And for `VectorAssembler` it shouldn't really matter. – zero323 Jun 05 '17 at 12:07
I'm not able to rename it either- it is failing even when I try to rename and show. – DINESHKUMAR MURUGAN Nov 13 '17 at 16:06
the `df.toDF(df.columns.map(_.replace(".", "_")): _*)` does not go deeper to nested columns. i had to use an ugly trick to flatten the df (which also had to patch due to the dot in column names), and then use above to replace the dots after the flatten process. – soMuchToLearnAndShare Jul 20 '22 at 12:03

score 0 · Answer 3 · answered Dec 30 '20 at 13:31

0

For Spark SQL

spark.sql("select * from reg_data where reg_data.`createdResource.type` = 'Berlin'")

answered Dec 30 '20 at 13:31

s510

2,271
11
18

could you add some addtional information, why your post would solve the problem? Please consider to check [answer] – dboy Dec 30 '20 at 14:00
this is an example for spark SQL – s510 Jan 04 '21 at 13:48

Column name with dot spark

3 Answers3

Linked