How to replace value by NAN in spark data frame (problem is parallization)

Question

Task : Let df be a spark data frame. We want to replace a value n in df by NA.

In R I would simply write

df[df==n] <- NA

Problems / questions : (as I am new to Spark any comment is welcome)

What is the equivalent in SparkR to NA? I found functions like isNull and isNAN and I am confused if there are some differences.

I was able to do it on one column col1 using ifelse, i.e.

df[[col1]] <- ifelse( df[[col1]] == n, NA, df[[x]])

but I was not able to "parallize" it.

I tried :

df <- spark.lapply(colnamed(df), function(x) {ifelse(df[[x]] == n, NA , df[[x]])})

but I got the message

Job aborted due to stage failure

which I do not understand.

score 0 · Answer 1 · answered Jan 04 '19 at 17:42

0

answered Jan 04 '19 at 17:42

Marc0

Thank you for your answer. BUT beside the first link none is dealing with the problem / task, i.e. 1) how to apply a user defined function in general ? sparkr.lapply is for example mentioned in your link to the Sparkr documentation, so why does my "code" does not work ? where is my lack in understanding ? 2) as for somebody who is familiar with R I thought there might exists a easy solution for such a specific problem in sparkr – Christian Jan 05 '19 at 09:20
I don't see 1 being asked anywhere in the question. Read the SparkR API you may find some clues there. – Marc0 Jan 06 '19 at 21:17

1 Answers1