3

I have a data frame with n number of columns and I want to replace empty strings in all these columns with nulls.

I tried using

val ReadDf = rawDF.na.replace("columnA", Map( "" -> null));

and

val ReadDf = rawDF.withColumn("columnA", if($"columnA"=="") lit(null) else $"columnA" );

Both of them didn't work.

Any leads would be highly appreciated. Thanks.

Vinay Kumar
  • 1,664
  • 2
  • 15
  • 19
  • 1
    There is many errors here and it's not just related to replacing an empty string with null but also the jdbc error and as a matter of fact, that second error can't be resolved with the current description and information because it cannot be reproduced. Please review your question as it is subject of being closed ! – eliasah May 22 '17 at 07:40
  • One thing I don't understand is, How is [this][1] relevant to the question I asked. It talks about an array of strings in MYSQL and my question was about replacing empty strings with nulls in the data frame. Can someone enlighten me? – Vinay Kumar May 23 '17 at 11:57
  • Here it is. I haven't voted for this question to be closed as a duplicated but as I said in the comment earlier, you have 2 differents errors and the latest error message is related to the question mentioned by the voters. – eliasah May 23 '17 at 12:01
  • 1
    @eliasah I was not referring to you. I am very sorry that I didn't address the persons. Let me see if I can edit the comment and add them. – Vinay Kumar May 23 '17 at 12:28
  • I know you are not. That's not a problem. – eliasah May 23 '17 at 12:34
  • https://stackoverflow.com/users/6910411/user6910411, https://stackoverflow.com/users/4993128/jwvh, https://stackoverflow.com/users/704848/edchum Can you guys explain how this question is duplicate of https://stackoverflow.com/questions/17371639/how-to-store-arrays-in-mysql? If this is a mistake, please remove it. Its quiet misleading – Vinay Kumar Aug 16 '18 at 08:01
  • 1
    I'm not sure why this was closed as a duplicate of an SQL question, but this has been reopened. – cs95 Sep 12 '18 at 23:50

1 Answers1

6

Your first approach seams to fail due to a bug that prevents replace from being able to replace values with nulls, see here.

Your second approach fails because you're confusing driver-side Scala code for executor-side Dataframe instructions: your if-else expression would be evaluated once on the driver (and not per record); You'd want to replace it with a call to when function; Moreover, to compare a column's value you need to use the === operator, and not Scala's == which just compares the driver-side Column object:

import org.apache.spark.sql.functions._

rawDF.withColumn("columnA", when($"columnA" === "", lit(null)).otherwise($"columnA"))
Tzach Zohar
  • 37,442
  • 3
  • 79
  • 85