1

I have a spark Scala library and I am building a python wrapper on top of it. One class of my library provides the following method

package com.example
class F {
  def transform(df: DataFrame): DataFrame
}

and I am using py4j in the following way to create a wrapper for F

def F():
return SparkContext.getOrCreate()._jvm.com.example.F()

which allows me to call the method transform

The problem is that the python Dataframe object is obviously different from the Java Dataframe object. For this purpose, I need a way to convert a python df to a java one, for which I use the following code from py4j docs

class DataframeConverter(object):
def can_convert(self, object):
    from pyspark.sql.dataframe import DataFrame
    return isinstance(object, DataFrame)

def convert(self, object, gateway_client):
    from pyspark.ml.common import _py2java
    return _py2java(SparkContext.getOrCreate(), object)

protocol.register_input_converter(DataframeConverter())

My problem is that now I want to do the inverse: getting a java dataframe from the transform and continue to use it in python. I tried to use protocol.register_output_converter but I couldn't find any useful example, apart for code dealing with java collections.

How can I do that? An obvious solution would be to create a python class F which defines all methods present in java F, forwards all the python calls to the jvm, get back the result and convert it accordingly. This approach works but it implies that I have to redefine all methods of F thus generating code duplication and a lot more maintainance

alexlipa
  • 1,131
  • 2
  • 12
  • 27
  • Possible duplicate of [How to use a Scala class inside Pyspark](https://stackoverflow.com/questions/36023860/how-to-use-a-scala-class-inside-pyspark) – user10938362 May 24 '19 at 11:08
  • my question is how to register this behaviour automatically, instead of manually writing the df transformation in input/output – alexlipa May 24 '19 at 11:23

0 Answers0