2

I first ask my question on this page : Spark CountVectorizer return udt instead of vector

The answer was quite correct. I have another question, if you clearly check my CountVectorizer output the format is the following : [0, 3, ...]

After check inside my Databricks notebook it seems that the format of this line is the following :

features` AS STRUCT<`type`: TINYINT, `size`: INT, `indices`: ARRAY<INT>, `values`: ARRAY<DOUBLE>>

But after checking the JavaDoc of CountVectorizer, I see that "type" nowhere.

What is it and how to remove it ? Because this leads me to a

org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`features` AS STRUCT<`type`: TINYINT, `size`: INT, `indices`: ARRAY<INT>, `values`: ARRAY<DOUBLE>>)' due to data type mismatch: cannot cast vector to vector;

when i try to convert him to RDD for my LDA.

1 Answers1

2

You are confusing two different things:

  • Schema type and external type of the column - in this case org.apache.spark.ml.linalg.SQLDataTypes.VectorType and org.apache.spark.ml.linalg.Vector respectively.
  • Internal representation of the UserDefinedType (its sqlType).

Internal attributes of the UserDefinedType are in general not accessible.

You might be able to access internal structure, using to_json - from_json trick, similarly to what is shown here,

import org.apache.spark.sql.types._

val schema = StructType(Seq(StructField(
  "features", 
  StructType(Seq(
    StructField("indices", ArrayType(LongType, true), true), 
    StructField("size", LongType, true),
    StructField("type", ShortType, true), 
    StructField("values", ArrayType(DoubleType, true), true)
)), true)))

df.select(
  from_json(
    to_json(struct($"features")), schema
   ).getItem("features").alias("data")
)

but considering that

i try to convert him to RDD for my LDA.

it is just a waste of time. If you're using Datasets go with new o.a.s.ml API, which already provides LDA implementation. Please follow the examples in the official documentation for details - Latent Dirichlet allocation (LDA)

  • Indeed, it is much easier with this API. I created a pipeline [link](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/8599738367597028/1359427237536511/3601578643761083/latest.html) and everything works now thanks. – Vince Robatel May 28 '18 at 21:23