I am working with DataFrames which elements have got a schema similar to:
root
|-- NPAData: struct (nullable = true)
| |-- NPADetails: struct (nullable = true)
| | |-- location: string (nullable = true)
| | |-- manager: string (nullable = true)
| |-- service: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- serviceName: string (nullable = true)
| | | |-- serviceCode: string (nullable = true)
|-- NPAHeader: struct (nullable = true)
| | |-- npaNumber: string (nullable = true)
| | |-- date: string (nullable = true)
In my DataFrame I want to group all elements which has the same NPAHeader.code
, so to do that I am using the following line:
val groupedNpa = orderedNpa.groupBy($"NPAHeader.code" ).agg(collect_list(struct($"NPAData",$"NPAHeader")).as("npa"))
After this I have a dataframe with the following schema:
StructType(StructField(npaNumber,StringType,true), StructField(npa,ArrayType(StructType(StructField(NPAData...)))))
An example of each Row would be something similar to:
[1234,WrappedArray([npaNew,npaOlder,...npaOldest])]
Now what I want is to generate another DataFrame with picks up just one of the element in the WrappedArray, so I want an output similar to:
[1234,npaNew]
Note: The chosen element from the WrappedArray is the one that matches a complext logic after iterating over the whole WrappedArray. But to simplify the question, I will pick up always the last element of the WrappedArray (after iterating all over it).
To do so, I want to define a recurside udf
import org.apache.spark.sql.functions.udf
def returnRow(elementList : Row)(index:Int): Row = {
val dif = elementList.size - index
val row :Row = dif match{
case 0 => elementList.getAs[Row](index)
case _ => returnRow(elementList)(index + 1)
}
row
}
val returnRow_udf = udf(returnRow _)
groupedNpa.map{row => (row.getAs[String]("npaNumber"),returnRow_udf(groupedNpa("npa")(0)))}
But I am getting the following error in the map:
Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type Int => Unit is not supported
What am I doing wrong?
As an aside, I am not sure if I am passing correctly the npa
column with groupedNpa("npa")
. I am accesing the WrappedArray as a Row, because I don't know how to iterate over Array[Row]
(the get(index)
method is not present in Array[Row])