0

I'm completely new to Spark and Scala, trying to work with a data set in Databricks.

I loaded a csv file as data frame. Now, I want see the percentage of null values in each column. Later I want to replace the null values or drop the column, depending on the percentage of null values.

I think R has some packages capable of analyzing null values (e.g. MICE package), but in Spark & Scala I can't find anything similar.

I've been trying to filter the data frame by "null" values, but this doesn't seem to work. Below code just returns the cabins that are not null. Swapping the == by != doesn't help.

train.show()
val train = sqlContext.sql("SELECT * FROM titanic_test")
val filtered = train.filter("Cabin==null")
filtered.show()

Does anyone know a package that could help or know how to fix my above problem, so I can filter manually?

This image shows the data set before it was filtered

This image shows that the filtering is not working

Laura
  • 191
  • 1
  • 12
  • I added second question because you claimed to be interested in _percentage of null values_. And the second answer explains exactly why you cannot apply equality checks to `NULL` values. If you don't agree with the closure you can always ask for reopening. – zero323 Mar 13 '17 at 20:31
  • Ok, thanks! Let's leave it closed. I agree it was two questions in one and not asked in an incredibly clear way. Plus I found a solution that worked for me. I'll experiment a bit more with "NULL" . – Laura Mar 13 '17 at 23:01

0 Answers0