I want to make my own transformer of features in a DataFrame
, so that I add a column which is, for example, a difference between two other columns. I followed this question, but the transformer there operates on one column only. pyspark.ml.Transformer
takes a string as an argument for inputCol
, so of course I can not specify multiple columns.
So basically, what I want to achieve is a _transform()
method that resembles this one:
def _transform(self, dataset):
out_col = self.getOutputCol()
in_col = dataset.select([self.getInputCol()])
# Define transformer logic
def f(col1, col2):
return col1 - col2
t = IntegerType()
return dataset.withColumn(out_col, udf(f, t)(in_col))
How is this possible to do?