2

In SparkR shell 1.5.0, Created a sample data set:

df_test <- createDataFrame(sqlContext, data.frame(mon = c(1,2,3,4,5), year = c(2011,2012,2013,2014,2015)))
df_test1 <- createDataFrame(sqlContext, data.frame(mon1 = c(1,2,3,4,5,6,7,8)))
df_test2 <- join(df_test1, df_test, joinExpr = df_test1$mon1 == df_test$mon, joinType = "left_outer")

data set : df_test2

+----+----+------+
|mon1| mon|  year|
+----+----+------+
| 7.0|null|  null|
| 1.0| 1.0|2011.0|
| 6.0|null|  null|
| 3.0| 3.0|2013.0|
| 5.0| 5.0|2015.0|
| 8.0|null|  null|
| 4.0| 4.0|2014.0|
| 2.0| 2.0|2012.0|
+----+----+------+

Question: If there is null how can I replace it with 0 in column df_test2$year or else use a default value?

The output should look like this,

+----+----+------+
|mon1| mon|  year|
+----+----+------+
| 7.0|null|  0   |
| 1.0| 1.0|2011.0|
| 6.0|null|  0   |
| 3.0| 3.0|2013.0|
| 5.0| 5.0|2015.0|
| 8.0|null|  0   |
| 4.0| 4.0|2014.0|
| 2.0| 2.0|2012.0|
+----+----+------+

I have used otherwise/when, but doesn't work

df_test2$year <- otherwise(when(isNull(df_test2$year), 0 ), df_test2$year)

It throw ed error,

Error in rep(yes, length.out = length(ans)) :
  attempt to replicate an object of type 'environment'
zero323
  • 322,348
  • 103
  • 959
  • 935
Arun Gunalan
  • 814
  • 7
  • 26
  • what exactly is the type of your "null" value? (I don't know SparkR, sorry) – Alexandre Halm Jan 07 '16 at 11:51
  • Try this `df_test2$year[is.null(df_test2$year)] <- 0` Subset your data and "overwrite" the nulls. – Roman Jan 07 '16 at 11:52
  • Are you looking for sparkR commands? Problem is easily solved in R, with @Jimbou 's solution or `setDT(df_test2)[is.null(year),year:=0]`, but will these work for you in spark env? – mtoto Jan 07 '16 at 12:09
  • Am looking for SparkR commands.. – Arun Gunalan Jan 08 '16 at 05:01
  • Possible duplicate of [How to remove empty entries in SparkR](http://stackoverflow.com/questions/31598611/how-to-remove-empty-entries-in-sparkr) – zero323 Jan 09 '16 at 21:10

2 Answers2

3

I have used raw SQL case when expression to get the answer,

df_test3 <- sql(sqlContext, "select mon1, mon, case when year is null then 0 else year end year FROM temp")

showDF(df_test3)
+----+----+------+
|mon1| mon|  year|
+----+----+------+
| 7.0|null|   0.0|
| 1.0| 1.0|2011.0|
| 6.0|null|   0.0|
| 3.0| 3.0|2013.0|
| 5.0| 5.0|2015.0|
| 8.0|null|   0.0|
| 4.0| 4.0|2014.0|
| 2.0| 2.0|2012.0|
+----+----+------+

Even though it gives the answer, i am looking for pure sparkR code.

Arun Gunalan
  • 814
  • 7
  • 26
3
df_test2$year <- ifelse(isNull(df_test2$year), 0, df_test2$year)

Pure SparkR code that resolves the problem.

Billal Begueradj
  • 20,717
  • 43
  • 112
  • 130
Stanley
  • 31
  • 2