How can I declare that a given Column in my DataFrame
contains categorical information?
I have a Spark SQL DataFrame
which I loaded from a database. Many of the columns in this DataFrame
have categorical information, but they are encoded as Longs (for privacy).
I want to be able to tell spark-ml that even though this column is Numerical the information is actually Categorical. The indexes of categories may have a few holes, and it is acceptable. (Ex. a column may have the values [1, 0, 0 ,4])
I am aware that there exists the StringIndexer
but I would prefer to avoid the hassle of encoding and decoding, specially because I have many columns that have this behavior.
I would be looking for something that looks like the following
train = load_from_database()
categorical_cols = ["CategoricalColOfLongs1",
"CategoricalColOfLongs2"]
numeric_cols = ["NumericColOfLongs1"]
## This is what I am looking for
## this step detects the min and max value of both columns
## and adds metadata to indicate this as a categorical column
## with (1 + max - min) categories
categorizer = ColumnCategorizer(columns = categorical_cols,
autoDetectMinMax = True)
##
vectorizer = VectorAssembler(inputCols = categorical_cols +
numeric_cols,
outputCol = "features")
classifier = DecisionTreeClassifier()
pipeline = Pipeline(stages = [categorizer, vectorizer, classifier])
model = pipeline.fit(train)