how to vectorize a for loop in r with conditionals

Question

I'm struggling with this task for quite some time therefore I thought I'd ask your help.

In df1 I'm trying to add a new column based on both information in this df1 as well as information in df2. So in df2 a column with ID should be created whenever the locations in both dfs match and the time stamp in df2 is within the interval given by df1, otherwise return 0. The issue is that they are of unequal length. I know how to write a nested for loop, but it's ugly and takes forever to run. I tried using sapply as proposed as a solution to a similar issue, but it won't run due to dissimilar df length

I found this thread [Speed up the loop operation in R, but because my dfs on which the conditionals act are of different lengths, I cant get this solution to work.

This is my data:

df1 <- structure(list(ID = c(NA, NA, 10035010L), location = c("barge", 
"barge", "barge"), start = structure(c(NA, NA, 
1427301960), class = c("POSIXct", "POSIXt"), tzone = ""), end = structure(c(NA, 
NA, 1437418440), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("ID", 
"location", "start", "end"), row.names = c(NA, 3L), class = "data.frame")

df2<-structure(list(time = structure(c(1419062220, 1419063120, 1427325120, 
1427325240, 1427325360, 1427325540, 1427325660, 1427326680, 1427568960, 
1427569320, 1427569500), class = c("POSIXct", "POSIXt"), tzone = ""), 
    location = c("barge", "barge", "barge", 
    "barge", "barge", "barge", "barge", 
    "barge", "barge", "barge", "barge"
    )), row.names = c(222195L, 222196L, 186883L, 186884L, 186885L, 
186886L, 186887L, 186888L, 186930L, 186931L, 186932L), class = "data.frame", .Names = c("time", 
"location"))

UPDATE: I decided to go with the dplyr package as I feel comfortable working with it, and used it on my larger dataset. However, a problem arises because the output is not consistent across locations when I include station ID.

Consider the same but slightly modified datasets that include station to see the difference in results:

df3<-structure(list(time = structure(c(1419061860, 1419062220, 1419063120, 
1427325120, 1427325240, 1427325360, 1427325540, 1427325660, 1427326680, 
1427568960, 1427569320), class = c("POSIXct", "POSIXt"), tzone = ""), 
    station = c(104667L, 104667L, 104667L, 124083L, 124083L, 
    124083L, 124083L, 124083L, 124083L, 124083L, 124083L), location = c("barge", 
    "barge", "barge", "barge", "barge", 
    "barge", "barge", "barge", "barge", 
    "barge", "barge")), row.names = 879:889, class = "data.frame", .Names = c("time", "station", "location"))

and

df4<-structure(list(station = c(124083L, 113071L), location = c("barge", 
"barge"), ID = c(10035010L, NA), start = structure(c(1427301960, 
NA), class = c("POSIXct", "POSIXt"), tzone = ""), end = structure(c(1437418440, 
NA), class = c("POSIXct", "POSIXt"), tzone = "")), row.names = 3:4, class = "data.frame", .Names = c("station", 
"location", "ID", "start", "end"))

when I run the dplyr solution,

df3 %>% left_join(., df4) %>%
  mutate(ID = ifelse(time >= start & time < end, ID, 0))

it doesn't return the same output i.e. in the first case the dataset returned is a multiple of the original data, in the last case the dataset returned is of equal length. I just can't figure out why it's different. It makes using the filter() function impossible. Any suggestions how to tackle this problem would be highly appreciated. Thanks

JackStat · Answer 1 · 2016-02-15T04:15:31.877

1

I just used some old fashioned SQL code for a similar problem the other day. Try this

library(sqldf)

sqldf('
SELECT 
  df2.*
  ,CASE WHEN df1.location is NOT NULL THEN 1 ELSE 0 END AS id
FROM df2
LEFT JOIN df1 ON df2.time > df1.start AND df2.time < df1.end
  ')

If you are doing this on a large dataset I would avoid the dplyr code above as the join goes cartesian before the filter removes the unnecessary rows. I am hoping that someone adds conditional joins in dplyr soon

edited Feb 15 '16 at 04:15

answered Feb 15 '16 at 03:11

JackStat

1,593
1
11
17

This looks like an elegant solution to my problem. I am completely unaware of the sqldf package or sql coding, but i will have a look at the vignette. Thanks! – FlyingDutch Feb 15 '16 at 14:20

score 1 · Accepted Answer · answered Feb 15 '16 at 03:41

You can use dplyr to join the two data frames and mutate as follows:

library(dplyr)
df2 %>% left_join(., df1) %>%
  mutate(ID = ifelse(time > start & time < end, 1, 0))

Output will be as follows (you can filter rows with NA if you like):

                  time location ID               start                 end
1  2014-12-20 02:57:00    barge NA                <NA>                <NA>
2  2014-12-20 02:57:00    barge NA                <NA>                <NA>
3  2014-12-20 02:57:00    barge  0 2015-03-25 12:46:00 2015-07-20 14:54:00
4  2014-12-20 03:12:00    barge NA                <NA>                <NA>
5  2014-12-20 03:12:00    barge NA                <NA>                <NA>
6  2014-12-20 03:12:00    barge  0 2015-03-25 12:46:00 2015-07-20 14:54:00
7  2015-03-25 19:12:00    barge NA                <NA>                <NA>
8  2015-03-25 19:12:00    barge NA                <NA>                <NA>
9  2015-03-25 19:12:00    barge  1 2015-03-25 12:46:00 2015-07-20 14:54:00
10 2015-03-25 19:14:00    barge NA                <NA>                <NA>
11 2015-03-25 19:14:00    barge NA                <NA>                <NA>
12 2015-03-25 19:14:00    barge  1 2015-03-25 12:46:00 2015-07-20 14:54:00
13 2015-03-25 19:16:00    barge NA                <NA>                <NA>
14 2015-03-25 19:16:00    barge NA                <NA>                <NA>
15 2015-03-25 19:16:00    barge  1 2015-03-25 12:46:00 2015-07-20 14:54:00
16 2015-03-25 19:19:00    barge NA                <NA>                <NA>
17 2015-03-25 19:19:00    barge NA                <NA>                <NA>
18 2015-03-25 19:19:00    barge  1 2015-03-25 12:46:00 2015-07-20 14:54:00
19 2015-03-25 19:21:00    barge NA                <NA>                <NA>
20 2015-03-25 19:21:00    barge NA                <NA>                <NA>
21 2015-03-25 19:21:00    barge  1 2015-03-25 12:46:00 2015-07-20 14:54:00
22 2015-03-25 19:38:00    barge NA                <NA>                <NA>
23 2015-03-25 19:38:00    barge NA                <NA>                <NA>
24 2015-03-25 19:38:00    barge  1 2015-03-25 12:46:00 2015-07-20 14:54:00
25 2015-03-28 14:56:00    barge NA                <NA>                <NA>
26 2015-03-28 14:56:00    barge NA                <NA>                <NA>
27 2015-03-28 14:56:00    barge  1 2015-03-25 12:46:00 2015-07-20 14:54:00
28 2015-03-28 15:02:00    barge NA                <NA>                <NA>
29 2015-03-28 15:02:00    barge NA                <NA>                <NA>
30 2015-03-28 15:02:00    barge  1 2015-03-25 12:46:00 2015-07-20 14:54:00
31 2015-03-28 15:05:00    barge NA                <NA>                <NA>
32 2015-03-28 15:05:00    barge NA                <NA>                <NA>
33 2015-03-28 15:05:00    barge  1 2015-03-25 12:46:00 2015-07-20 14:54:00

This solution is more intuitive to me, but requires 2 steps to get the desired result. thank you. — FlyingDutch, Feb 15 '16 at 14:36
I found that including na.omit () will do the trick to filter NAs. awesome! — FlyingDutch, Feb 15 '16 at 16:01
it seems that when I add station ID as a factor for joining it doesn't create the same kind of output as you provided. I don't know how to remove duplicated time rows. Any suggestions? I updated my OP. Thank you. — FlyingDutch, Feb 16 '16 at 16:41
That is because joins will join on all common columns and create 'missing' columns. You can specify the column to join by using the `by` argument to `left_join`. In this case, you can specify `location` only so it does not join by station ID. — Gopala, Feb 17 '16 at 02:01

score 1 · Answer 3 · answered Feb 15 '16 at 04:54

You can use outer to apply functions to two vectors of arbitrary length. It should make only the necessary calculations (i.e., unique combinations). In your case, you would use outer three times for your logical tests, and combine the results into a single logical matrix.

gets_id <- outer(df2$location, df1$location, '==') & 
  outer(df2$time, df1$start, '>=') & 
  outer(df2$time, df1$end, '<=')

This yields the following output. TRUE values indicate that location is a match between dataframes and that time falls between start and end. NA values in the result are due to NA values in start and end.

      [,1] [,2]  [,3]
 [1,]   NA   NA FALSE
 [2,]   NA   NA FALSE
 [3,]   NA   NA  TRUE
 [4,]   NA   NA  TRUE
 [5,]   NA   NA  TRUE
 [6,]   NA   NA  TRUE
 [7,]   NA   NA  TRUE
 [8,]   NA   NA  TRUE
 [9,]   NA   NA  TRUE
[10,]   NA   NA  TRUE
[11,]   NA   NA  TRUE

Once you have your result, you can manipulate it as you like. The following will work for your use case.

assignments <- which(gets_id, arr.ind=TRUE)
df2$id[assignments[,'row']] <- df1$ID[assignments[,'col']]

Resulting in:

                      time location       id
222195 2014-12-20 02:57:00    barge       NA
222196 2014-12-20 03:12:00    barge       NA
186883 2015-03-25 19:12:00    barge 10035010
186884 2015-03-25 19:14:00    barge 10035010
186885 2015-03-25 19:16:00    barge 10035010
186886 2015-03-25 19:19:00    barge 10035010
186887 2015-03-25 19:21:00    barge 10035010
186888 2015-03-25 19:38:00    barge 10035010
186930 2015-03-28 14:56:00    barge 10035010
186931 2015-03-28 15:02:00    barge 10035010
186932 2015-03-28 15:05:00    barge 10035010

Thank you, best solution without having to use any package for immediate results. — FlyingDutch, Feb 15 '16 at 14:43

how to vectorize a for loop in r with conditionals

3 Answers3