2

I'm working with different size of dataSet each one with a dynamic size of columns - for my application, I have a requirement to know the entire row length of characters for estimate the entire row size in Bytes or KBytes.

The result of entire row size(in KB) will be written to a new column.

private void writeMyData(Dataset<Row> dataSet){

        Column[] columns = Arrays.stream(dfToWrite.columns()).map(col-> functions.col(col)).toArray(Column[]::new);

        dataSet.withColumn("marker", functions.length(functions.concat_ws( dataSet.columns()[3],columns))).write().partitionBy(hivePartitionColumn)
                .option("header", "true")
                .mode(SaveMode.Append).format(storageFormat).save(pathTowrite);

}

As I've none of the method of org.apache.spark.sql.functions return Column[] So i had to use dataSet.columns() and Collect it.

But using nested operation function.method each time don't seem efficient.

I would rather have a function size that's gets Column[] and return the entire length of the columns. instead of having nested operation.

  1. Is there a way you can help me with UDF function for this kind of operation? Or is there an existing function for this kind of operation?
  2. How bad is it using this kind of solution?

Java solution is preferred.

2Big2BeSmall
  • 1,348
  • 3
  • 20
  • 40

1 Answers1

2

nice solution with spark Dataframe UDF I have used to get Bytes length which is better for my case:

static UDF1 BytesSize = new UDF1<String, Integer>() {
    public Integer call(final String line) throws Exception {
        return line.getBytes().length;
    }
};

private void saveIt(){

sparkSession.udf().register("BytesSize",BytesSize,DataTypes.IntegerType);
    dfToWrite.withColumn("fullLineBytesSize",callUDF("BytesSize",functions.concat_ws( ",",columns)) ).write().partitionBy(hivePartitionColumn)
                    .option("header", "true")
                    .mode(SaveMode.Append).format(storageFormat).save(pathTowrite);
}
2Big2BeSmall
  • 1,348
  • 3
  • 20
  • 40