I have a spark Scala library and I am building a python wrapper on top of it. One class of my library provides the following method
package com.example
class F {
def transform(df: DataFrame): DataFrame
}
and I am using py4j in the following way to create a wrapper for F
def F():
return SparkContext.getOrCreate()._jvm.com.example.F()
which allows me to call the method transform
The problem is that the python Dataframe object is obviously different from the Java Dataframe object. For this purpose, I need a way to convert a python df to a java one, for which I use the following code from py4j docs
class DataframeConverter(object):
def can_convert(self, object):
from pyspark.sql.dataframe import DataFrame
return isinstance(object, DataFrame)
def convert(self, object, gateway_client):
from pyspark.ml.common import _py2java
return _py2java(SparkContext.getOrCreate(), object)
protocol.register_input_converter(DataframeConverter())
My problem is that now I want to do the inverse: getting a java dataframe from the transform
and continue to use it in python. I tried to use protocol.register_output_converter
but I couldn't find any useful example, apart for code dealing with java collections.
How can I do that? An obvious solution would be to create a python class F
which defines all methods present in java F
, forwards all the python calls to the jvm, get back the result and convert it accordingly. This approach works but it implies that I have to redefine all methods of F
thus generating code duplication and a lot more maintainance