39

I am trying to take columns from a DataFrame and convert it to an RDD[Vector].

The problem is that I have columns with a "dot" in their name as the following dataset :

"col0.1","col1.2","col2.3","col3.4"
1,2,3,4
10,12,15,3
1,12,10,5

This is what I'm doing :

val df = spark.read.format("csv").options(Map("header" -> "true", "inferSchema" -> "true")).load("C:/Users/mhattabi/Desktop/donnee/test.txt")
val column=df.columns.map(c=>s"`${c}`")
val rows = new VectorAssembler().setInputCols(column).setOutputCol("vs")
  .transform(df)
  .select("vs")
  .rdd
val data =rows.map(_.getAs[org.apache.spark.ml.linalg.Vector](0))
  .map(org.apache.spark.mllib.linalg.Vectors.fromML)

val mat: RowMatrix = new RowMatrix(data)
//// Compute the top 5 singular values and corresponding singular vectors.
val svd: SingularValueDecomposition[RowMatrix, Matrix] = mat.computeSVD(mat.numCols().toInt, computeU = true)
val U: RowMatrix = svd.U  // The U factor is a RowMatrix.
val s: Vector = svd.s  // The singular values are stored in a local dense vector.
val V: Matrix = svd.V  // The V factor is a local dense matrix.

println(V)

Please any help to get me consider columns with dot in their names.Thanks

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
Maher HTB
  • 737
  • 3
  • 9
  • 23

3 Answers3

65

If your problem is the .(dot) in the column name, you could use `(backticks) to enclose the column name.

df.select("`col0.1`")

Ricardo Mutti
  • 2,639
  • 2
  • 19
  • 20
  • 3
    Take care however that drop function behaves well with no backticks. In this case: df.drop("col0.1") – devilpreet Aug 23 '18 at 19:28
  • @devilpreet As well as withColumn does work ok with "dotted" names. Apparently only referencing such a column in expressions needs to be escaped with backticks. – Puterdo Borato May 13 '21 at 08:43
  • 2
    Note this also works: `df.select("parent_column.\`child.column.with.dots.i.dont.want.expanded\`")` – Wassadamo Aug 16 '21 at 21:46
  • @Wassadamo further selection from the dataframe using the same syntax does not work. e.g i have this, notice the complaints about the column name is actually in the suggested list.``` AnalysisException: Column 'USER_DEFINED_EXTENSIONS.`UDX.EDXF.REACH`.`UDX.EDXF.REACH.LISTDATE`' does not exist. Did you mean one of the following? [USER_DEFINED_EXTENSIONS.`UDX.EDXF.REACH`.`UDX.EDXF.REACH.LISTDATE` ``` – soMuchToLearnAndShare Jul 20 '22 at 09:44
13

The problem here is VectorAssembler implementation, not the columns per se. You can for example skip the header:

val df = spark.read.format("csv")
  .options(Map("inferSchema" -> "true", "comment" -> "\""))
  .load(path)

new VectorAssembler()
  .setInputCols(df.columns)
  .setOutputCol("vs")
  .transform(df)

or rename columns before passing to VectorAssembler:

val renamed =  df.toDF(df.columns.map(_.replace(".", "_")): _*)

new VectorAssembler()
  .setInputCols(renamed.columns)
  .setOutputCol("vs")
  .transform(renamed)

Finally the best approach is to provide schema explicitly:

import org.apache.spark.sql.types._

val schema = StructType((0 until 4).map(i => StructField(s"_$i", DoubleType)))

val dfExplicit = spark.read.format("csv")
  .options(Map("header" -> "true"))
  .schema(schema)
  .load(path)

new VectorAssembler()
  .setInputCols(dfExplicit.columns)
  .setOutputCol("vs")
  .transform(dfExplicit)
zero323
  • 322,348
  • 103
  • 959
  • 935
  • 1
    Subject for a JIRA ? – eliasah Jun 05 '17 at 11:38
  • @zero323 i appreciate your help , you mean there is no other solution to handle this while keeping the same columns name ? thanks – Maher HTB Jun 05 '17 at 11:59
  • 4
    None that I am aware off, but I never intensively looked for a one. In general you shouldn't use anything else than `[a-zA-Z0-9_]` for the column names to keep reasonable compatibility with other tools (including Parquet). And for `VectorAssembler` it shouldn't really matter. – zero323 Jun 05 '17 at 12:07
  • I'm not able to rename it either- it is failing even when I try to rename and show. – DINESHKUMAR MURUGAN Nov 13 '17 at 16:06
  • the `df.toDF(df.columns.map(_.replace(".", "_")): _*)` does not go deeper to nested columns. i had to use an ugly trick to flatten the df (which also had to patch due to the dot in column names), and then use above to replace the dots after the flatten process. – soMuchToLearnAndShare Jul 20 '22 at 12:03
0

For Spark SQL

spark.sql("select * from reg_data where reg_data.`createdResource.type` = 'Berlin'")
s510
  • 2,271
  • 11
  • 18