0

I have a spark dataFrame that looks like this:

  id      dates value
1 11 2013-11-15    10
2 11 2013-11-16    15
3 22 2013-11-15    20
4 22 2013-11-16    21
5 22 2013-11-17     3

I wish to retain the value from the previous date per id.

The final result should look like this:

  id      dates value prev_value
1 11 2013-11-15    10        NA
2 11 2013-11-16    15        10
3 22 2013-11-15    20        NA
4 22 2013-11-16    21        20
5 22 2013-11-17     3        21

The solution from this question would not work for various reasons.

I would appreciate the help!

Community
  • 1
  • 1
Nikolay Nenov
  • 547
  • 1
  • 8
  • 27
  • Not a SparkR functions, but window functions would be what you're looking for (eg http://stackoverflow.com/questions/34885981/sparkr-window-function) – David Oct 04 '16 at 19:09
  • @David , I was not able to find something that would work for my case, but if you can suggest something - I'll be happy to accept your answer. Cheers – Nikolay Nenov Oct 05 '16 at 17:45

1 Answers1

1

So after playing with it for a while, here's the workaround that I found:

First of all, here's the example DF

id<-c(11,11,22,22,22)
dates<-as.Date(c('2013-11-15','2013-11-16','2013-11-15','2013-11-16','2013-11-17'), "%Y-%m-%d")
value <- c(10,15,20,21,3)

example<-as.DataFrame(data.frame(id=id,dates=dates, value))

I copy the example DF and add 1 day to the original date, then rename the column

example_p <- example
example_p$dates <- date_add(example_p$dates, 1)
colnames(example_p) <- c("id", "dates", "prev_value")

Finally, I merge the new DF to the original one

result <- select(merge(example, example_p, by = intersect(names(example),names(example_p))
              , all.x = T), c("id_x", "dates_x", "value", "prev_value"))

showDF(result)

+----+----------+-----+----------+
|id_x|   dates_x|value|prev_value|
+----+----------+-----+----------+
|22.0|2013-11-15| 20.0|      null|
|11.0|2013-11-15| 10.0|      null|
|11.0|2013-11-16| 15.0|      10.0|
|22.0|2013-11-16| 21.0|      20.0|
|22.0|2013-11-17|  3.0|      21.0|
+----+----------+-----+----------+

Obviously, this is somehow clumsy and I will be happy to give the points to anyone who can suggest a solution that would work faster than this.

Nikolay Nenov
  • 547
  • 1
  • 8
  • 27