I'm trying to use UDF with input type Array of struct. For example, let's say I have the following structure of data. This would all come from a single column from a table, from a single row.
[
{
"id": { "value": "23tsdag"},
"parser": { }
"probability: 1
},
{
"id": { "value": "ysadoghues"},
"parser": { }
"probability: .98
},
{
"id": { "value": "ds8galiusgh4"},
"parser": { }
"probability: .7
},
...
...
...
{
"id": { "value": "28sh32ds"},
"parser": { }
"probability: .3
}
]
For my JAVA UDF, I want to read this in as a Seq<Row>
(since according to Spark SQL UDF with complex input parameter it says that "... struct
types are converted to o.a.s.sql.Row
... data will be exposed as Seq[Row]
)".)
Therefore, this is my JAVA Code:
public class MyUdf implements UDF1<Seq<Row>, String> {
public String call(Seq<Row> sequence) throws Exception {
...
...
...
return "Some String";
}
}
How can I test this piece of code? Specifically, I've been trying to read json from a file, turn it into a Dataset<Row>
, turn that into a List<Row>
, and then turn that into Seq<Row>
, then pass it as parameter into my UDF as follows:
@Test
public void testMyUdf() throws Exception {
sqlCtx.udf().registerJava("my_udf", MyUdf.class, DataTypes.StringType);
String filePath = "sample_1.json";
Dataset<Row> ds = spark.read().option("multiline", "true").json(filePath);
List<Row> list = ds.collectAsList();
Seq<Row> sequence = JavaConverters.collectionAsScalaIterableConverter(list).asScala().toSeq();
sqlCtx.sql( "select my_udf(" + sequence + ")").show();
...
...
assertEquals(...)
}
However, when I do this, I keep getting errors such as this:
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '(' expecting {')', ','}(line 1, pos 52)
== SQL ==
select my_udf(Stream([[ABC/42gadsgy5wsdga==],.....
--------------------^^^
Am I doing something wrong? I've been stuck on this all day and any pointers/tips/help would be greatly appreciated. Thank you.
The whole point of me doing this is so that my UDF can take in a Seq<Row>
as described in Spark SQL UDF with complex input parameter. Is this even the right approach?
I wanted to be as generic as possible by using Rows instead of having specific classes (because the input contents may be vastly different)