Spark CountVectorizer return a TinyInt

Question

I first ask my question on this page : Spark CountVectorizer return udt instead of vector

The answer was quite correct. I have another question, if you clearly check my CountVectorizer output the format is the following : [0, 3, ...]

After check inside my Databricks notebook it seems that the format of this line is the following :

features` AS STRUCT<`type`: TINYINT, `size`: INT, `indices`: ARRAY<INT>, `values`: ARRAY<DOUBLE>>

But after checking the JavaDoc of CountVectorizer, I see that "type" nowhere.

What is it and how to remove it ? Because this leads me to a

org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`features` AS STRUCT<`type`: TINYINT, `size`: INT, `indices`: ARRAY<INT>, `values`: ARRAY<DOUBLE>>)' due to data type mismatch: cannot cast vector to vector;

when i try to convert him to RDD for my LDA.

user9860462 · Accepted Answer · 2018-05-28T18:50:07.413

You are confusing two different things:

Schema type and external type of the column - in this case org.apache.spark.ml.linalg.SQLDataTypes.VectorType and org.apache.spark.ml.linalg.Vector respectively.
Internal representation of the UserDefinedType (its sqlType).

Internal attributes of the UserDefinedType are in general not accessible.

You might be able to access internal structure, using to_json - from_json trick, similarly to what is shown here,

import org.apache.spark.sql.types._

val schema = StructType(Seq(StructField(
  "features", 
  StructType(Seq(
    StructField("indices", ArrayType(LongType, true), true), 
    StructField("size", LongType, true),
    StructField("type", ShortType, true), 
    StructField("values", ArrayType(DoubleType, true), true)
)), true)))

df.select(
  from_json(
    to_json(struct($"features")), schema
   ).getItem("features").alias("data")
)

but considering that

i try to convert him to RDD for my LDA.

it is just a waste of time. If you're using Datasets go with new o.a.s.ml API, which already provides LDA implementation. Please follow the examples in the official documentation for details - Latent Dirichlet allocation (LDA)

Indeed, it is much easier with this API. I created a pipeline [link](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/8599738367597028/1359427237536511/3601578643761083/latest.html) and everything works now thanks. — Vince Robatel, May 28 '18 at 21:23

Spark CountVectorizer return a TinyInt

1 Answers1