4

First, due to the three value logic, this isn't just the negation of any valid implementation of a null-or-empty check.

I want to make a function isNotNullish , which is as close as possible to isNotNull but also filters out empty strings. I'm running into some oddities involving how column/column types work, as well as three value logic. So far the best I have is:

def isNotNullish(questionable: Column) : Column = {
  val oddish = questionable === lit("")
  questionable.isNotNull && (!oddish || oddish.isNull)
}

If that looks strange it's because it is. As far as I can tell, questionable === lit("") will return null for a given row if that row holds either null or (some!) non-string types. In three value logic, true && null = null, which would cause questionable.isNotNull && (questionable =!= lit("") to return null in some cases where I'd want it to return true. The questionable.isNotNull && (!oddish || oddish.isNull) code should never yield null, always true or false.

This almost-not-quite works: for some reason I don't understand, the === comparison is happy to return null for numeric types, but fails for complex types. (Even if what I'm trying to do is impossible or inadvisable, I'd love an explanation for that.)

It might be more responsible to check the schema and simply do a different test for string types than for others, but as far as I can tell that requires the dataframe to be passed as a parameter. I'm trying to avoid that, especially for long sequences of transformations where said parameter might be anonymous.

(Before anyone asks, I know that dealing with null and types in this sloppy a way would be horrible in general scala, but I think it's different in the context of Spark/SQL/huge dataframes of varying schema. The specific case is automatic exploration of sketchy data, so being able to answer questions like "I don't know what the columns are, but tell me how often they hold actual values" is useful.)

Edward Peters
  • 3,623
  • 2
  • 16
  • 39

1 Answers1

3

Does this work you to use <=>

    val employees = spark.createDataFrame(Seq(("E1","100.0"), ("E2","200.0"),("E3",null),("E4",""))).toDF("employee","salary")
    employees.show()
    employees.filter(notNullAndEmpty($"salary")).show()

  def notNullAndEmpty(c:Column): Column ={
    (c.isNotNull and !(c <=> lit("")))
  }

Data -

+--------+------+
|employee|salary|
+--------+------+
|      E1| 100.0|
|      E2| 200.0|
|      E3|  null|
|      E4|      |
+--------+------+

Result

+--------+------+
|employee|salary|
+--------+------+
|      E1| 100.0|
|      E2| 200.0|
+--------+------+
Salim
  • 2,046
  • 12
  • 13
  • Thanks! I didn't know about the `<=>` operator, but that seems to solve it perfectly. – Edward Peters Feb 13 '20 at 18:38
  • On further checking, I think that has the same problem mine does - type errors if you compare to structured data. (The `<=>` is more elegant, though) – Edward Peters Feb 13 '20 at 20:46
  • by structured data you mean a Long, Array, a Struct etc? – Salim Feb 13 '20 at 21:34
  • Arrays, specifically. (I suspect structs would cause problems too, but I flatten them out at an earlier point in my code so I'm not sure.) Are Longs structs? Did you mean Map? – Edward Peters Feb 13 '20 at 21:44
  • @Edward I see your point. The <=> fails when data type mismatches. We can get data type of column and pass that in to the function and then compared only where data type matches. Another way is to use Dataset and Scala code. – Salim Feb 13 '20 at 22:21
  • @EdwardPeters try something like this .filter((col("colum_x").isNotNull()).and(col("colum_x").notEqual("null"))) – Shasu Aug 21 '23 at 12:58