2

I would like to fill down values (replacing NAs with previous value) on a sparklyr dataframe, the field holds strings.

In essence I would like to transform this:

ID, String
1    a
2    NaN
3    b
4    NaN

to this:

ID, String
1    a
2    a
3    b
4    b

Is there any way to do this in sparklyr?

hjkop
  • 65
  • 5
  • [Relevant](https://github.com/rstudio/sparklyr/issues/1040) – Sotos Jan 18 '19 at 12:11
  • In the linked github issue they are solving this for a numeric column using cummax, not sure how to apply this solution to a string column. – hjkop Jan 18 '19 at 12:41
  • Also, I might be reading that one wrong, but it seemed to assume that the numbers would be ascending in the column and it just takes the largest number until that point to fill. – hjkop Jan 18 '19 at 13:43
  • 1
    With high level API like `sparklyr` there is no straightforward solution that can achieve the expected result and scale out, in case no grouping is applied. Note that the solution in the linked issue doesn't scale at all, and if you go this way, you can as well use plain R data.frame - see [Avoid performance impact of a single partition mode in Spark window functions](https://stackoverflow.com/q/41313488/6910411) – zero323 Jan 22 '19 at 17:09
  • Is there a way to invoke a function that could achieve this? The problem is that I have to do this on quite a lot of data (and a lot of groups) and it kills R if I try to bring it over to a normal data frame. – hjkop Jan 23 '19 at 23:22
  • It seems doable in pyspark, but I would avoid having to go to python just for this, so really looking to translate it to spark sql I guess. So for the actual frame that I am working on there are ID,Y,Q,owner columns where owner is only filled if the company for the ID changes owner and trying to fill these forward for each ID across Y,Qs – hjkop Jan 24 '19 at 16:36
  • in pyspark: `import pyspark.sql.functions as func from pyspark.sql.window import Window from pyspark.sql.functions import last orderCols = ['Y','Q'] window = Window.partitionBy('ID')\ .orderBy(orderCols)\ .rowsBetween(-sys.maxsize, 0) filled_column = last(df['Owner'], ignorenulls=True).over(window) spark_df_filled = df.withColumn('owner_filled', filled_column) spark_df_filled.orderBy(orderCols).show(100) ` – hjkop Jan 24 '19 at 16:37

2 Answers2

0

The recent package update for sparklyr includes fill which behaves the same way as the tidyr function of the same name. Moreover, we can look at the SQL and apply the LAST function directly. In the example below I partition by "grp" because for my own work I wanted to group it, just remove the group_by/PARTITION BY to do it without groups.

Create the table

a <- data.frame(vals = c(NA,NA,NA,1,NA,NA,NA,NA,NA,6,NA,NA,NA,10), row_num = 1:14, grp = c(1,1,1,1,1,1,1,2,2,2,2,2,2,2))
b <- copy_to(con,a,"example", overwrite = TRUE)

Apply backfill in SQL

c <- sdf_sql(con,"
    SELECT* 
    , LAST(vals, TRUE) OVER  (PARTITION BY grp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS last_val
    FROM example
             ")

Apply backfull in sparklyr

Admittedly, I've had some trouble using the fill function but this is how it should go.

c <- sdf_sql(con, "SELECT * FROM example")
c <- c %>% mutate(last_val = fill(val,.direction="down")

Alternatively you could just apply the SQL in the mutate which for sure works.

c <- sdf_sql(con, "SELECT * FROM example")
c <- c %>% mutate(last_val = sql("LAST LAST(vals, TRUE) OVER  (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)"))

edog429
  • 216
  • 2
  • 7
-1

This worked for your example:

library(sparklyr)
library(tidyverse)

sc <- spark_connect(master = "local")

df<-tibble(ID=c(1,2,3,4), String=c("a",NaN,"b",NaN))

tbl_df<-sdf_copy_to(sc,df,"tbl_df",overwrite = TRUE)

tbl_df<-tbl_df%>%mutate(String=ifelse(String=="NaN",lag(String,order = ID),String))

> glimpse(tbl_df)
Observations: ??
Variables: 2
$ ID     <dbl> 1, 2, 3, 4
$ String <chr> "a", "a", "b", "b"
Antonios
  • 1,919
  • 1
  • 11
  • 18
  • 1
    That might work on the exact example used by the OP, but it doesn't generalize, and what's more important, doesn't scale. – zero323 Jan 22 '19 at 17:10