How to develop implementation of the Function1/MapFunction interfaces in Spark 2.3 where the Dataset are mutated

Question

Which is the preferred way to implement a class based on the Function1/MapFunction interfaces in Spark 2.3, where the class will mutate the individual rows schema? Ultimately every row's schema might become different depending on the result of different look-ups.

Something like:

public class XyzProcessor implements Function1<Row, Row> {
...
    public Row call(Row row) throws Exception {
        /// The `row` schema will be changed here...
        return row;
    }
...

The .map method of the Dataset will be called as:

ExpressionEncoder<Row> rowEncoder = RowEncoder.apply(foo.schema());
dataset.map(new XyzProcessor(), rowEncoder);

The "problem" is that the XyzProcessor will alter the schema by adding columns to the row thus rendering the rowEncoder in a faulty state schema wise. How is the preferred way to deal with this?

Is this the right way to accomplish Dataset modifications?

score 0 · Answer 1 · answered Apr 19 '18 at 12:36

There is a conceptual error in your design:

Ultimately every row's schema might become different depending on the result of different look-ups.

Schema in Spark SQL has to be fixed. In the worst case scenario it can be serialized BLOB, but it has to be consistent across all rows.

You have to go back to the blackboard, and redesign your process. If output is type compatible (there are no conflicts for (path, type) tuples) then making remaining fields nullable should solve your problem.

If not, I would go with RDDs which support proper type hierarchies.

Thank you so much for the answer. I was "afraid" that a fixed schema was a requirement especially if I wanted to write the result to a parquet-file. Better to know this "up front" :) — Kodo, Apr 19 '18 at 15:37

How to develop implementation of the Function1/MapFunction interfaces in Spark 2.3 where the Dataset are mutated

1 Answers1