Why PySpark execute only the default statement in my custom `SQLTransformer`

Question

I wrote a custom SQLTransformer in PySpark. And setting a default SQL statement is mandatory to have the code being executed. I can save the custum transformer within Python, load it and execute it using Scala or/and Python but only the default statement is executed despite the fact that there is something else in the _transform method. I have the same result for both languages, then the problem is not related to _to_java method or JavaTransformer class.

class filter(SQLTransformer): 
    def __init__(self):
        super(filter, self).__init__() 
        self._setDefault(statement = "select text, label from __THIS__") 

    def _transform(self, df): 
        df = df.filter(df.id > 23)
        return df

I need to call the `SQLTransformer` in a Scala pipeline. I can save the `SQLTransformer` within Python, load and run it in the Scala side, but despite the fact that I define a `_transform` method in the class, the default statement is executed in the Scala side. — Bentech, Nov 13 '18 at 18:29

score 1 · Answer 1 · answered Nov 13 '18 at 21:43

1

Such information flow is not supported. To create a Tranformer that can be used with both Python and Scala code base you have:

Implement Java or Scala Transformer, in your case extending org.apache.spark.ml.feature.SQLTransformer.
Add Python wrapper extending pyspark.sql.ml.wrapper.JavaTransformer the same way as pyspark.sql.ml.feature.SQLTransformer and interface JVM counterpart from it.

answered Nov 13 '18 at 21:43

1

Thanks, that's to say that custom Transformer written in Python can not be used in a Scala pipeline. Because, If I need to write the same code in Scala and in Python, better use directly what is already written in Scala in my Scala pipeline. – Bentech Nov 14 '18 at 15:36

Why PySpark execute only the default statement in my custom `SQLTransformer`

1 Answers1