3

I am new to R trying to rewrite an R code in sparkR. One of the operations on data.table named costTbl (which has 5 other columns) is

costTbl[,cost:=na.locf(cost,na.rm=FALSE),by=product_id]
costTbl[,cost:=na.locf(cost,na.rm=FALSE, fromLast=TRUE),by=product_id]

I am unable to find an equivalent operation in sparkR. I thought gapply can be used by grouping the df on product_id and performing this operation. But I am not able to make the code work.

Is gapply the right approach? Is there some other way for achieving this?

raizsh
  • 456
  • 1
  • 6
  • 16

2 Answers2

0

Start with some dummy data.

library(SparkR)
library(magrittr)

df <- createDataFrame(data.frame(
  time = c(1, 2, 3, 1, 2, 3),
  product_id = c(1, 1, 1, 2, 2, 2),
  cost = c(1, 2, NA, NA, 2, NA)
))

Use last with na.rm = TRUE and an appropriate window spec.

df %>%
  mutate(
    cost = over(
      last("cost", na.rm = TRUE),
      windowPartitionBy("product_id") %>% orderBy("time") %>% rowsBetween(Window.unboundedPreceding, 0)
    )
  ) %>%
  collect()
#>   time product_id cost locf_cost
#> 1    1          1    1         1
#> 2    2          1    2         2
#> 3    3          1   NA         2
#> 4    1          2   NA        NA
#> 5    2          2    2         2
#> 6    3          2   NA         2
Paul
  • 8,734
  • 1
  • 26
  • 36
0

I was finally able to use SparkR UDFs to perform locf using the existing native R code. We can use gapply for this use case, by grouping my dataframe on the column product_id.

Have shared my findings here : https://shbhmrzd.medium.com/stl-and-holt-from-r-to-sparkr-1815bacfe1cc

raizsh
  • 456
  • 1
  • 6
  • 16