I'm struggling with this task for quite some time therefore I thought I'd ask your help.
In df1 I'm trying to add a new column based on both information in this df1 as well as information in df2. So in df2 a column with ID should be created whenever the locations in both dfs match and the time stamp in df2 is within the interval given by df1, otherwise return 0
. The issue is that they are of unequal length. I know how to write a nested for loop, but it's ugly and takes forever to run. I tried using sapply as proposed as a solution to a similar issue, but it won't run due to dissimilar df length
I found this thread [Speed up the loop operation in R, but because my dfs on which the conditionals act are of different lengths, I cant get this solution to work.
This is my data:
df1 <- structure(list(ID = c(NA, NA, 10035010L), location = c("barge",
"barge", "barge"), start = structure(c(NA, NA,
1427301960), class = c("POSIXct", "POSIXt"), tzone = ""), end = structure(c(NA,
NA, 1437418440), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("ID",
"location", "start", "end"), row.names = c(NA, 3L), class = "data.frame")
df2<-structure(list(time = structure(c(1419062220, 1419063120, 1427325120,
1427325240, 1427325360, 1427325540, 1427325660, 1427326680, 1427568960,
1427569320, 1427569500), class = c("POSIXct", "POSIXt"), tzone = ""),
location = c("barge", "barge", "barge",
"barge", "barge", "barge", "barge",
"barge", "barge", "barge", "barge"
)), row.names = c(222195L, 222196L, 186883L, 186884L, 186885L,
186886L, 186887L, 186888L, 186930L, 186931L, 186932L), class = "data.frame", .Names = c("time",
"location"))
UPDATE: I decided to go with the dplyr package as I feel comfortable working with it, and used it on my larger dataset. However, a problem arises because the output is not consistent across locations when I include station ID.
Consider the same but slightly modified datasets that include station to see the difference in results:
df3<-structure(list(time = structure(c(1419061860, 1419062220, 1419063120,
1427325120, 1427325240, 1427325360, 1427325540, 1427325660, 1427326680,
1427568960, 1427569320), class = c("POSIXct", "POSIXt"), tzone = ""),
station = c(104667L, 104667L, 104667L, 124083L, 124083L,
124083L, 124083L, 124083L, 124083L, 124083L, 124083L), location = c("barge",
"barge", "barge", "barge", "barge",
"barge", "barge", "barge", "barge",
"barge", "barge")), row.names = 879:889, class = "data.frame", .Names = c("time", "station", "location"))
and
df4<-structure(list(station = c(124083L, 113071L), location = c("barge",
"barge"), ID = c(10035010L, NA), start = structure(c(1427301960,
NA), class = c("POSIXct", "POSIXt"), tzone = ""), end = structure(c(1437418440,
NA), class = c("POSIXct", "POSIXt"), tzone = "")), row.names = 3:4, class = "data.frame", .Names = c("station",
"location", "ID", "start", "end"))
when I run the dplyr solution,
df3 %>% left_join(., df4) %>%
mutate(ID = ifelse(time >= start & time < end, ID, 0))
it doesn't return the same output i.e. in the first case the dataset returned is a multiple of the original data, in the last case the dataset returned is of equal length. I just can't figure out why it's different. It makes using the filter() function impossible. Any suggestions how to tackle this problem would be highly appreciated. Thanks