I have a column of a spark dataset (in java) and I want that all the values of this column become the columnnames of new columns (the new columns can be filled with a constant values).
For example I have:
+------------+
| Column |
+------------+
| a |
| b |
| c |
+------------+
And I want:
+------+----+----+---+
|Column| a | b | c |
+------+----+----+---+
| a | 0 | 0 |0 |
| b | 0 | 0 |0 |
| c | 0 | 0 |0 |
+------+----+----+---+
What I tried is:
public class test{
static SparkSession spark = SparkSession.builder().appName("Java")
.config("spark.master", "local").getOrCreate();
static Dataset<Row> dataset = spark.emptyDataFrame();
public Dataset<Row> test(Dataset<Row> ds, SparkSession spark) {
SQLContext sqlContext = new SQLContext(spark);
sqlContext.udf().register("add", add, DataTypes.createArrayType(DataTypes.StringType));
ds = ds.withColumn("substrings", functions.callUDF("add", ds.col("Column")));
return ds;
}
private static UDF1 addSubstrings = new UDF1<String, String[]>() {
public String[] call(String str) throws Exception {
dataset = dataset.withColumn(str, functions.lit(0));
String[] a = {"placeholder"};
return a;
}
};
}
My problem is, sometimes I get the right result and sometimes not (the columns are not added). I do not really understand why. I was searching for a way to pass the datset to the UDF but I don't know how.
At the moment I'm solving it by using collectAsList() of the column, then iterating the Arraylist and thereby adding new columns. But that is really inefficient since I have too much data.