4

I have been researching about how to run Python code from Java code and I have seen a few options to do that.

My scenario is a little different, imagine a Spark application written in java which will process a large dataset (let's say 3B of records, around 1TB in size) distributed. For every single record, the Python code will be called once. Java code will need to pass an Avro record and the Python code will process it and will return result.

Given that the performance is important and we will deal with large datasets, I am trying to figure out the best option to approach this problem.

dbustosp
  • 4,208
  • 25
  • 46
  • The main idea that comes to mind for [theoretical] simplicity is registering Python UDF(s) for use in Java. Here are some related SOs that I haven't vetted let alone tested myself: https://stackoverflow.com/questions/36171208/implement-a-java-udf-and-call-it-from-pyspark - https://stackoverflow.com/questions/29143033/how-to-register-python-function-as-udf-in-sparksql-in-java-scala – Garren S Aug 18 '17 at 03:43
  • More resources (creating permanent UDF in hive to use in spark): https://stackoverflow.com/questions/39023638/unable-to-use-an-existing-hive-permanent-udf-from-spark-sql - https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-20033 – Garren S Aug 18 '17 at 03:51

0 Answers0