0

I am working with spark pipelines and find myself often in a situation where I have a bunch of SQLTransformers that do different things in a pipeline and cant really understand what they do without looking at the entire statement.

I would like to add maybe some simple documentation or tag component to each transformer type(which will be persisted when the transformer is saved) and can be retrieved later if need be.

So basically something like this.

s = SQLTransformer()
s.tag = "basic target generation"
s.save("tmp")

s2 = SQLTransformer.load("tmp")
print(s2.tag)

or

s = SQLTransformer()
s.setParam(tag="basic target generation")
s.save("tmp")

s2 = SQLTransformer.load("tmp")
print(s2.getParam("tag"))

I can see that I cant do either right now because the param objects are locked down and I cant seem to modify the existing ones other than statement or add new ones. But is there anything I can do to get some functinality like this?

I am using Spark 2.1.1 with python.

1 Answers1

0

Not without implementing your own Scala Transformer extending SQLTransformer and then writing Python interface (or writing standalone Python Transformer - How to Roll a Custom Estimator in PySpark mllib).

However if you

would like to add maybe some simple documentation

you can just add comments to the statement:

s = SQLTransformer(statement = """
    -- This is a transformer that selects everything
    SELECT * FROM __THIS__""")

print(s.getStatement())

##    -- This is a transformer that selects everything
##    SELECT * FROM __THIS__
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
  • Ahhh I feared this was the case... So no hacks other than putting a comment in the statement directly huh? – Subramaniam Ramasubramanian Aug 06 '18 at 09:27
  • By the way, writing a standalone python transformer that can be persisted isnt possible unless you are running Spark 2.3+ right? Because thats the key reason I am going through this entire ordeal to begin with! From what I can see, the only way to write transformers that can be saved is to write them in scala, expose them in python and then redirect the save methods to the scala api. – Subramaniam Ramasubramanian Aug 06 '18 at 09:29