0

re Spark Doc 2.3:

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext.registerJavaFunction

registerJavaFunction(name, javaClassName, returnType=None)[source]

Register a Java user-defined function as a SQL function.

In addition to a name and the function itself, the return type can be >optionally specified. When the return type is not specified we would infer it via reflection.

Parameters:

name – name of the user-defined function

javaClassName – fully qualified name of java class

returnType – the return type of the registered Java function. The value can be either a pyspark.sql.types.DataType object or a DDL-formatted type string.


My question:

I want to have a library of large number of UDFs, for Spark 2.3+, all written in Java and all accessible from PySpark/Python.

Reading documentation which I linked above it appears that the there is a one to one mapping between a class and Java UDF function (callable from Spark-SQL in PySpark). So that if I have say 10 Java UDF functions then I need to create 10 public Java classes with 1 UDF per class to make them callable from PySpark/SQL.

Is this correct?

Can I create 1 public Java class and place a number of different UDFs inside the 1 class and make all UDFs callable from PySpark in Spark 2.3 ?

This post does not provide any Java sample code to help with my question. It looks like it is all in Scala. I want it all in Java please. Do I need to extend a class or implement interface to do it in Java? Any links to sample Java code to be called from PySpark-SQL would be appreciated.

Spark: How to map Python with Scala or Java User Defined Functions?

Community
  • 1
  • 1
Acid Rider
  • 1,557
  • 3
  • 17
  • 25

2 Answers2

2

So that if I have say 10 Java UDF functions then I need to create 10 public Java classes with 1 UDF per class to make them callable from PySpark/SQL.

Is this correct?

Yes, that's correct. However you can:

Community
  • 1
  • 1
  • I have read the post you linked to before I posted my question. There is no mention of your suggestions there. Please be more specific. Perhaps you can provide a small example in Java? I was very specific in my question that I want it all done in Java, *not* Scala. – Acid Rider Aug 11 '18 at 23:13
1

Below very simple Java/Python/Pyspark code sample may help someone, I got it working on Spark 2.3.1 and Java 1.8 for a Java UDF callable from Python.

Note that this approach seems very cumbersome to me as you need a separate Java class for each one Java UDF. So for 50 discrete Java UDFs = 50 separate public Java classes! Ideally if a single public Java class could contain a number of individual Java UDFs, all packaged in a single JAR file this would be ideal. Alas I still dont know how to do it.

Improvement suggestions welcome! Thank you

// Java 8 code 
package com.yourdomain.sparkUDF;

import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.sql.*;
import org.apache.spark.sql.api.java.UDF0;
import org.apache.spark.sql.api.java.UDF1;
import org.apache.spark.sql.types.DataTypes;


public final class JavaUDFExample 
        implements UDF0<String> {
    @Override
    public String call() throws Exception {
        return java.util.UUID.randomUUID().toString();
    }
}
// end of Java code
// make a jar file from above including all referenced jar Spark libraries

# PySPark Python code below
from pyspark.sql import SparkSession
from pyspark     import SparkConf, SparkContext
from pyspark.sql import HiveContext
from pyspark.sql.types import IntegerType
from pyspark.sql.types import StringType


spark = SparkSession.builder.appName("Java UDF Example").getOrCreate() 

df = spark.read.json(r"c:\temp\temperatures.json")
df.createOrReplaceTempView("citytemps")

spark.udf.registerJavaFunction("getGuid", "com.yourdomain.sparkUDF.JavaUDFExample", StringType())

spark.sql("SELECT getguid() as guid, * FROM citytemps").show()
# end of PySpark-SQL Python code

DOS shell script to run on local Spark:

spark-submit --jars c:\dir\sparkjavaudf.jar python-udf-example.py
Acid Rider
  • 1,557
  • 3
  • 17
  • 25