I'm working with different size of dataSet each one with a dynamic size of columns - for my application, I have a requirement to know the entire row length of characters for estimate the entire row size in Bytes or KBytes.
The result of entire row size(in KB) will be written to a new column.
private void writeMyData(Dataset<Row> dataSet){
Column[] columns = Arrays.stream(dfToWrite.columns()).map(col-> functions.col(col)).toArray(Column[]::new);
dataSet.withColumn("marker", functions.length(functions.concat_ws( dataSet.columns()[3],columns))).write().partitionBy(hivePartitionColumn)
.option("header", "true")
.mode(SaveMode.Append).format(storageFormat).save(pathTowrite);
}
As I've none of the method of org.apache.spark.sql.functions return Column[]
So i had to use dataSet.columns()
and Collect it.
But using nested operation function.method
each time don't seem efficient.
I would rather have a function size that's gets Column[]
and return the entire length of the columns.
instead of having nested operation.
- Is there a way you can help me with UDF function for this kind of operation? Or is there an existing function for this kind of operation?
- How bad is it using this kind of solution?
Java solution is preferred.