0

I start off by building an empty data frame:

results <- data.frame(ID=numeric(0), StartDate=numeric(0), term_type=character(0), EndDate=numeric(0), stringsAsFactors = FALSE)

I then have a list of unique ID numbers: uniqueIds <- c(1234, 4566, 7838)

I have a function getDataForGivenId that produces a data frame in the format:

ID, StartDate, term_type, EndDate

I need the code to call the function getDataForGivenId for each ID and append the resulting data frame to the empty dataframe results.

I have tried:

library(dplyr)

results <- bind_rows(results, (lapply(uniqueIds, getDataForGivenId)))

and

do.call("rbind", lapply(uniqueIds, getDataForGivenId))

and

for (Id in uniqueIds) {
    Y <- getDataForGivenId(Id)
    results <- rbind(results, Y) 
}

Everytime I just end up with an empty results dataframe.

Note that if I take things out of the loop and simply execute the code:

Y <- getDataForGivenId(1234)
results <- rbind(results, Y)

I get the output I expect.

Does anyone know what I'm doing wrong?

EDIT -- My full script is below.

library(dplyr)
library(lubridate)

enVariables <- Sys.getenv()
username    <- enVariables[["DB_USERNAME"]]
password    <- enVariables[["DB_PASSWORD"]]

results <- data.frame(ID=numeric(0), StartDate=numeric(0), term_type=character(0), EndDate=numeric(0), stringsAsFactors = FALSE)

getConnection <- function(){
  require(RMySQL)
  username <- username
  password <- password 
  con <- dbConnect(
    MySQL(), user=username, password=password, 
    dbname='database', host='host', port=port
  )
  return(con)
}

queryuniqueIds <- "SELECT DISTINCT(id) FROM table LIMIT 5"
con                <- getConnection()
uniqueIds      <- dbGetQuery(con, queryuniqueIds)
dbDisconnect(con)

getDataForGivenID <- function(idNumber) {
  queryData <- paste0(
    "SELECT ",
    "Id, bill_date, bill_hour ",
    "FROM table ",
    "WHERE id = ", idNumber
  ) 
  con <- getConnection()
  Data <- dbGetQuery(con, queryData)
  dbDisconnect(con)
  X <- Data %>% 
    select(ID, bill_date, bill_hour) %>%
    mutate(
      bill_date_x = ymd_hms(bill_date)
    ) %>% 
    arrange(ID, bill_date, bill_hour)
  hour(X$bill_date_x) <- X$bill_hour
  X <- X %>% 
    mutate(
      lag_x     = lag(bill_date_x, 1),
      lag_diff  = difftime(bill_date_x,lag_x, units = "hours") %>% as.integer(),
      lead_x    = lead(bill_date_x, 1),
      lead_diff = difftime(lead_x, bill_date_x, units = "hours") %>% as.integer()
    )
  Y <- X %>% 
    filter(
      is.na(lag_diff) | 
        is.na(lead_diff) | 
        !(lag_diff == 1 & lead_diff == 1),

      is.na(lag_diff) | 
        is.na(lead_diff) | 
        !(lag_diff == 0 | lead_diff == 0)
    ) %>% 
    mutate(
      term_type = "N",
      term_type = replace(term_type, lead_diff == 1, "S"),
      term_type = replace(term_type, lag_diff  == 1, "E")
    )
  Y <- Y %>%
    select(ID, bill_date_x, term_type) %>% 
    mutate(
      lead_date = lead(bill_date_x, 1)
    )  %>% 
    filter(term_type == "S")
  colnames(Y) <- c("ID", "StartDate", "term_type", "EndDate")
  return(Y)
}

do.call("rbind", lapply(uniqueIds, getDataForGivenID))
View(results)
OnlyDean
  • 1,025
  • 1
  • 13
  • 25
  • The first thing I would check is whether `getDataForGivenId` is actually returning non-empty data frames. – joran Jun 25 '18 at 15:41
  • It is returning populated DataFrames, I checked that. – OnlyDean Jun 25 '18 at 15:42
  • In that case something else is going on that we probably can't help with or guess at without a more reproducible example. Because _if_ `lapply` is actually returning a list of non-empty data frames, the `do.call` line _will_ work, which suggests to me that there is something else going on that we don't know about. – joran Jun 25 '18 at 15:45
  • When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Jun 25 '18 at 15:46
  • so I assume `getDataForGivenId` is a list of data frames ? – YOLO Jun 25 '18 at 15:51
  • @YOLO no, `getDataForGivenId` is a function the returns a Data Frame and take an ID as a argument. – OnlyDean Jun 25 '18 at 16:02
  • @joran I have uploaded my entire script. – OnlyDean Jun 25 '18 at 16:09
  • @MrFlick I have uploaded my entire script – OnlyDean Jun 25 '18 at 16:09
  • 1
    This is gonna be tough to help with. We still basically have to take your word that `lapply(uniqueIds, getDataForGivenID)` successfully returns a list of non-empty data frames. But if I simply create a list of non-empty data frames with the columns you specify, `do.call("rbind",...)` works perfectly for me every time when I test it. Part of the "reproducible" bit here is trying to help you walk through the debugging process. Run `lapply(uniqueIds, getDataForGivenID)` and share the actual output from that via `dput`. – joran Jun 25 '18 at 16:17

1 Answers1

1

I finally figured out my problem.

The list uniqueIds was of length 1. R was passing in the entire list as once, causing the SQL statement to only return the data for the first id.

I changed

uniqueIds <- dbGetQuery(con, queryuniqueIds) 

to

uniqueIds <- as.data.frame(dbGetQuery(con, queryuniqueIds))

and

do.call("rbind", lapply(uniqueIds, getDataForGivenId))

to

results <- do.call("rbind", lapply(uniqueIds$id, getDataForGivenId))

Things now work as expected. Thank you to those who helped.

OnlyDean
  • 1,025
  • 1
  • 13
  • 25