Replace null/undefined/unknown values with most frequent ones using SparkSQL programmatically

Question

I have one dataset in spark application has the following shape:

some_id     class       city
1           A           ROME
1           A           undefined
1           A           ROME
1           null        ROME
2           B           MILAN
2           B           unkown
2           B           MILAN
2           unknown     MILAN
3           C           PALERMO
3           C           PALERMO
3           C           null
3           null        PALERMO

And the resulting dataset should look like the following:

some_id     class       city
1           A           ROME
1           A           ROME
1           A           ROME
1           A           ROME
2           B           MILAN
2           B           MILAN
2           B           MILAN
2           B           MILAN
3           C           PALERMO
3           C           PALERMO
3           C           PALERMO
3           C           PALERMO

I tried basic way of loops but I found it not practical, what is the best way to do that?

here is what i tried to do:

String[] columnsNames = {"class", "city"};

for (String columnName : columnsNames)  {
                Dataset<Row> grouped = mydataset.groupBy(col("some_id"), col(columnName)).agg(functions.count("*").alias("itemCount"));

                grouped = grouped
                        .where(not(col(columnName).equalTo("null")))
                        .groupBy(col("some_id"))
                        .agg(functions.max(col("itemCount")))
                ;

                grouped.show();

                ...etc    
}

[`melt` columns for which you want to find mode](https://stackoverflow.com/q/41670103/6910411), `groupBy` column and value and count, [take value corresponding to each (column, value) group](https://stackoverflow.com/q/41670103/6910411), collect and use to generate `na.fill` calls. — zero323, Jun 25 '18 at 15:33

score 0 · Answer 1 · answered Jun 25 '18 at 16:37

if the id = 1 is always associated with class A and city Rome you have two options: modify you input file or using spark (dataset) you can group by id, select only records that with class andcity != null or unknown and store the result in a new dataset. To achieve the result you can use sql feature and do something like this:

    dataset.createOrReplaceTempView("tempView");
    Dataset<Row> filteredRows = session.sql("select * from tempView where some_id <> null and some_id <> unknown and city <> null and city <> uknown group by some_id")

what if i have one column that contain timestamp which i forgot to mention. In this case, I need to keep all records and replace the unwanted values with the most frequent ones. — Tagera, Jun 26 '18 at 07:15

Replace null/undefined/unknown values with most frequent ones using SparkSQL programmatically

1 Answers1