I first ask my question on this page : Spark CountVectorizer return udt instead of vector
The answer was quite correct. I have another question, if you clearly check my CountVectorizer output the format is the following : [0, 3, ...]
After check inside my Databricks notebook it seems that the format of this line is the following :
features` AS STRUCT<`type`: TINYINT, `size`: INT, `indices`: ARRAY<INT>, `values`: ARRAY<DOUBLE>>
But after checking the JavaDoc of CountVectorizer, I see that "type" nowhere.
What is it and how to remove it ? Because this leads me to a
org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`features` AS STRUCT<`type`: TINYINT, `size`: INT, `indices`: ARRAY<INT>, `values`: ARRAY<DOUBLE>>)' due to data type mismatch: cannot cast vector to vector;
when i try to convert him to RDD for my LDA.