I'm just after a way to convert all the String
type variables in my PySpark dataframe to categorical variables so I can run a decision tree on the dataframe. I can't use pandas and can only use PySpark libraries due to resource constraints. I've identified VectorIndexer
as a possible solution, however, I don't understand how to convert all String
type columns which the documentation says is possible.
Could somebody help me with the syntax on how to do that? I'm after something like this:
featureIndexer = VectorIndexer(inputCol=<list of input columns>, outputCol=<list of output columns>, maxCategories=10).fit(df)
or letting the VectorIndexer
figure out which ones need vectoring on it's own, which the documentation seems to indicate that it can do.
featureIndexer = VectorIndexer(df, maxCategories=10).fit(df)
Thanks in advance.