0

I wrote a custom SQLTransformer in PySpark. And setting a default SQL statement is mandatory to have the code being executed. I can save the custum transformer within Python, load it and execute it using Scala or/and Python but only the default statement is executed despite the fact that there is something else in the _transform method. I have the same result for both languages, then the problem is not related to _to_java method or JavaTransformer class.

class filter(SQLTransformer): 
    def __init__(self):
        super(filter, self).__init__() 
        self._setDefault(statement = "select text, label from __THIS__") 

    def _transform(self, df): 
        df = df.filter(df.id > 23)
        return df
Bentech
  • 468
  • 5
  • 14
  • I need to call the `SQLTransformer` in a Scala pipeline. I can save the `SQLTransformer` within Python, load and run it in the Scala side, but despite the fact that I define a `_transform` method in the class, the default statement is executed in the Scala side. – Bentech Nov 13 '18 at 18:29

1 Answers1

1

Such information flow is not supported. To create a Tranformer that can be used with both Python and Scala code base you have:

  • Implement Java or Scala Transformer, in your case extending org.apache.spark.ml.feature.SQLTransformer.
  • Add Python wrapper extending pyspark.sql.ml.wrapper.JavaTransformer the same way as pyspark.sql.ml.feature.SQLTransformer and interface JVM counterpart from it.
  • 1
    Thanks, that's to say that custom Transformer written in Python can not be used in a Scala pipeline. Because, If I need to write the same code in Scala and in Python, better use directly what is already written in Scala in my Scala pipeline. – Bentech Nov 14 '18 at 15:36