3

I have a dataframe with missing values that I've written a function to fill using R 3.3.2

pkgs <- c("dplyr", "ggplot2", "tidyr", 'data.table', 'lazyeval')
lapply(pkgs, require, character.only = TRUE)

UID <- c('A', 'A', 'A', 'B', 'B', 'B', 'C', 'C')
Col1 <- c(1, 0, 0, 0, 1, 0, 0, 0)
df <- data.frame(UID, Col1)

Function to fill in Col1:

AggregatedColumns <- function(DF, columnToUse, NewCol1) {
  # Setting up column names to use
  columnToUse <- deparse(substitute(columnToUse))
  NewCol1 <- deparse(substitute(NewCol1))

  # Creating new columns 
  DF[[NewCol1]] <- ifelse(DF[[columnToUse]] == 1, 1, NA)
  DF <- DF %>% group_by_("UID") %>% sort(DF[[columnToUse]], decreasing = TRUE) %>% fill_(NewCol1)
  DF <- DF %>% group_by_("UID") %>% sort(DF$columnToUse, decreasing = TRUE) %>% fill_(NewCol1, .direction = 'up')
  DF[[NewCol1]] <- ifelse(is.na(DF[[NewCol1]]), 0, DF[[NewCol1]])

  DF
}

I've pulled out this part of the function since this is the piece that is slowing down the function. I'm very new to writing functions and any advice on how/if this can be sped up would be appreciated. I've isolated the speed issue down to the fill_ part of the function.

What I am trying to do is pass a dummy variable from Col1 to New_Column and then forward fills to other same ID's. For example:

UID             Col1
John Smith        1
John Smith        0

Should become

UID             Col1  New_Column
John Smith        1      1
John Smith        0      1

EDITED FUNCTION I edited the function to fit with @HubertL suggestion. The function is still fairly slow, but hopefully with these edits the example is reproducible.

AggregatedColumns <- function(DF, columnToUse, NewCol1) {
  # Setting up column names to use
  columnToUse <- deparse(substitute(columnToUse))
  NewCol1 <- deparse(substitute(NewCol1))

  # Creating new columns 
  DF[[NewCol1]] <- ifelse(DF[[columnToUse]] == 1, 1, NA)
    DF <- DF %>% group_by_("UID") %>% fill_(NewCol1) %>% fill_(NewCol1, .direction = 'up')
  DF[[NewCol1]] <- ifelse(is.na(DF[[NewCol1]]), 0, DF[[NewCol1]])

  DF
}

Desired output:

UID Col1 New
A    1    1 
A    0    1
A    0    1 
B    0    1
B    1    1
B    0    1
C    0    0
C    0    0
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
vino88
  • 163
  • 13
  • 1
    Can you show your desired output, how this function is being used and etc.? I can't make that work. – David Arenburg May 31 '17 at 21:47
  • 1
    Why don't you just `DF %>% group_by(UID) %>% fill(NewCol1) %>% fill(NewCol1, .direction = 'up')`? – HubertL May 31 '17 at 21:50
  • Thank you @HubertL, silly of me not to think of just piping them all together, though I'm still confused why after sorting just one fill doesn't work, but that's a different question – vino88 May 31 '17 at 22:02
  • sorry about that @DavidArenburg, edited – vino88 May 31 '17 at 22:03
  • If the first entry of col2 should be 0 for B, then `df$Col2 <- ave(df$Col1, df$UID, FUN=cummax)` will work. If you're looking for the max, then replace `cummax` with `max` in `ave`. – lmo May 31 '17 at 22:31
  • It does work if you use arrange to sort by NewCol1 `DF %>% group_by(UID) %>% arrange_(NewCol1, .dots=paste("desc(",NewCol1, ")")) %>% fill_(NewCol1)` but not if you sort by columnToUse, not sure why... – HubertL May 31 '17 at 22:45
  • 1
    I can't run this function. Please show all the packages you are using here. And also explain in words what are you trying to do. Your code could be easily sped up if we could understand what it actually does and how to run it. – David Arenburg Jun 01 '17 at 07:38
  • 1
    I can't run your function either. Getting `Error: Can't use matrix or array for column indexing`. – Axeman Jun 01 '17 at 09:06
  • Instead of piping into `sort`, you probably want to use `arrange_(lazyeval::interp(~desc(x), x = as.name(columnToUse)))`. – Axeman Jun 01 '17 at 09:18
  • Thanks everyone for the help, sorry for the reproducibility issues. i've edited the post to hopefully address these issues. – vino88 Jun 01 '17 at 17:55
  • 1
    Here's an easy and efficient way achieving this without using a single package using a single step `DF[[NewCol1]] <- as.integer(DF$UID %in% DF[DF[[columnToUse]] == 1, "UID"])` – David Arenburg Jun 01 '17 at 19:27

2 Answers2

2

If speed is a concern, you may try this with data.table and na.locf() from the zoo package. LOCF means last observation carried forward.

library(data.table)
setDT(df)[Col1 != 0, New := Col1 ][, New := zoo::na.locf(New), UID][is.na(New), New := 0][]
#   UID Col1 New
#1:   A    1   1
#2:   A    0   1
#3:   A    0   1
#4:   B    0   1
#5:   B    1   1
#6:   B    0   1
#7:   C    0   0
#8:   C    0   0

This is just to give an idea. It still needs to be wrapped in a function call.

It assumes that value 0 in Col1 is considered as missing.

Uwe
  • 41,420
  • 11
  • 90
  • 134
2

First of all, here are few points:

  1. You are needlessly calling ifelse (twice) while this function is very inefficient
  2. You needlessly using inefficient function (by group) from external package (also twice) when you could simply vectorize the process with just base R.

Here's a simple one-liner without using any external packages that enhances performance by a factor of x72 (and probably much more for bigger data sets) on a 5e7 data set

AggregatedColumns2 <- function(DF, columnToUse, NewCol1) {
    # Setting up column names to use
    columnToUse <- deparse(substitute(columnToUse))
    NewCol1 <- deparse(substitute(NewCol1))

    # Creating the new column (one simple line)
    DF[[NewCol1]] <- as.integer(DF$UID %in% DF$UID[DF[[columnToUse]] == 1])

    # returning new data set back
    DF
}

Benchmark

set.seed(123)
library(stringi)
N <- 5e7
UID <- stri_rand_strings(N, 2)
Col1 <- sample(0:1, N, replace = TRUE)
df <- data.frame(UID, Col1)


system.time(res <- AggregatedColumns(df, Col1, NewCol1))
#   user  system elapsed 
# 198.67    3.94  203.07 

system.time(res2 <- AggregatedColumns2(df, Col1, NewCol1))
# user  system elapsed 
# 2.82    0.00    2.82  

Now in order to compare those I will reorder both and convert to a matrix, because Hadleyverses packages add tons of unnecessary attributes (compare the mess created in str(res) vs the simple structure in str(res2))

identical(arrange(res, UID) %>% as.matrix, arrange(res2, UID) %>% as.matrix)
## [1] TRUE
David Arenburg
  • 91,361
  • 17
  • 137
  • 196