R - How to vectorize with apply family function and avoid while/for loops in this case?

Question

In this case (more details could be found in this question: Count how many observations in the rest of the dat fits multiple conditions? (R))

This is a dataset called event, containing thousands of events (observations) and I selected several rows to show you the data structure. It contains the "STATEid", "date" of occurrence, and geographical coordinates in two variables "LON" "LAT". I am writing to calculate a new variable (column) for each row. This new variable should be: "Given any specific incident, count the rest of the dataset and calculate the number of events that's happened in the same state, within the circle of 50/100KM radius, in the next 30/60 days.

I have written a user-defined function with a while loop - to make this easier, I only included 2 conditions, within 30 days and in the same state:

n = 1

f = function(i) {
  a = i[n,]
  b = a$date
  # c = a$LON
  # d = a$LAT
  e = a$STATEid
  f = a$RID
  g1 = sum(i$CASE  [i$date<= b+30 & i$date>b & i$STATEid==e], na.rm=T)
  # g2 = sum(i$viold [i$date<= b+30 & i$date>b], na.rm=T)
  # g3 = sum(i$CASE  [i$date<= b+60 & i$date>b], na.rm=T)
  # g4 = sum(i$viold [i$date<= b+60 & i$date>b], na.rm=T)
  # h = cbind(g1, g2, g3, g4)
  g1 = data.frame(g1)
  n = n+1
  assign(as.character(f), g1, envir = .GlobalEnv)
}

for (n in 1:20) (f(event2))

This has been taking too long since it contains 23,000 cases. My PC with a 16GB Ram cannot nail it when the loop only needs to run twice! So I think avoiding loops would be preferable. Could you please suggest a way to vectorize my codes and avoid the looping?

My main problem is that I don't know how to write user-defined problems when I need to refer to each row, each variable properly when multiple conditions are required - that's why in my loop function, I created objects such as "a", "b", "c", "d", "e" to call them properly... Inefficient - I know...

My dput outcome looks like so:

     > dput(tail(event2[,c("RID", "STATEid", "date", "LON", "LAT")]))
structure(list(RID = c("023610", "023611", "023613", "023614", 
"023615", "023616"), STATEid = structure(c(36L, 36L, 23L, 23L, 
5L, 14L), .Label = c("alabama", "alaska", "arizona", "arkansas", 
"california", "colorado", "connecticut", "delaware", "district of columbia", 
"florida", "georgia", "hawaii", "idaho", "illinois", "indiana", 
"iowa", "kansas", "kentucky", "louisiana", "maine", "maryland", 
"massachusetts", "michigan", "minnesota", "mississippi", "missouri", 
"montana", "nebraska", "nevada", "new hampshire", "new jersey", 
"new mexico", "new york", "north carolina", "north dakota", "ohio", 
"oklahoma", "oregon", "pennsylvania", "rhode island", "south carolina", 
"south dakota", "tennessee", "texas", "utah", "vermont", "virginia", 
"washington", "west virginia", "wisconsin", "wyoming"), class = "factor"), 
    date = structure(c(3620, -633, 131, -315, 5421, 3558), class = "Date"), 
    LON = c(-80.6495194, -80.6495194, -83.6129939, -83.6129939, 
    -121.6169108, -87.8328505), LAT = c(41.0997803, 41.0997803, 
    42.2411499, 42.2411499, 39.1404477, 42.4461322)), .Names = c("RID", 
"STATEid", "date", "LON", "LAT"), row.names = c(23610L, 23611L, 
23613L, 23614L, 23615L, 23616L), class = "data.frame")
>

Many thanks. I appreciate your help.

Best,

---------- Jan.20, 2018 updates ---------

I have created a loop that works and correctly reflects what I am hoping for:

g = event2[FALSE,]

USERFUN = function(i) {
  a = i[n,] # retrieve each row from the object, make it a data object
  b = a$date # get date
  # c = a$LON # for now I dropped the idea of calculating radius
  # d = a$LAT # for now I dropped the idea of calculating radius
  e = a$STATEid # get STATE
  f = a$RID # get case ID to name the objects generated!

  PostAct30 = sum(i$CASE [i$date<= b+30 & i$date>b & i$STATEid == e], na.rm=T) # multiple conditions defined here - i is the entire dataset 
  PostAct60 = sum(i$CASE [i$date<= b+60 & i$date>b & i$STATEid == e], na.rm=T) # multiple conditions defined here - b, e are dynamic, retrieving from each line!!!
  PreAct30 = sum(i$CASE [i$date<= b & i$date>b-30 & i$STATEid == e], na.rm=T)
  PreAct60 = sum(i$CASE [i$date<= b & i$date>b-30 & i$STATEid == e], na.rm=T)
  PostVio30 = sum(i$viold [i$date<= b+30 & i$date>b & i$STATEid == e], na.rm=T)
  PostVio60 = sum(i$viold [i$date<= b+60 & i$date>b & i$STATEid == e], na.rm=T)
  PreVio30 = sum(i$viold [i$date<= b & i$date>b-30 & i$STATEid == e], na.rm=T)
  PreVio60 = sum(i$viold [i$date<= b & i$date>b-30 & i$STATEid == e], na.rm=T)
  g1 = data.frame(f, PostAct30, PostAct60, PreAct30, PreAct60, PostVio30, PostVio60, PreVio30, PreVio60)
  n = n+1
  return(g1)
  }
# sum(event2$ca)
n = 1
for (n in 1:19446) {
  g2 = USERFUN(event2)
  g = rbind(g, g2)        
}

AND the output looks like so:

> tail(event3 [c("date","STATEid", "PostAct30", "PostAct60", "PostVio30", "PostVio60")])
            date    STATEid PostAct30 PostAct60 PostVio30 PostVio60
23611 1968-04-08       ohio         3         4         0         0
23612       <NA>    arizona        NA        NA        NA        NA
23613 1970-05-12   michigan         4         6         2         4
23614 1969-02-20   michigan         2         3         1         1
23615 1984-11-04 california         4         5         0         0
23616 1979-09-29   illinois         0         2         0         1

Your code is hanging because there is an infinite loop. Inside the function, n = n + 1 will not modify the value of n outside the function. The <<- operator will modify n the way you want. — Paul, Jan 20 '18 at 02:32
Thanks so much! I will modify it to a for loop in the main thread... However, the rest of the problem remains a concern for me... — Tony Chang, Jan 20 '18 at 02:37
You have `f = a$RID` but there's no `RID` column in your data. — eipi10, Jan 20 '18 at 07:38
Can you post what output you're hoping for if the function works as desired? Right now, even if the function is set up to produce a value for `f`, all six rows of `event` produce `g=0`. — eipi10, Jan 20 '18 at 07:43
@eipi10 Hi thanks so much for your help here. I have written a function with a loop that works. I will update the main thread to reflect my latest updates. Though it is a loop (not with apply family), it is something that works, and I am happy about it :-) Again, thanks! — Tony Chang, Jan 20 '18 at 22:13
Thanks for all your helpful comments above. Dear @Eipi10, I think when it generates 0, it is probably due to the fact that when the sample is small, the actual count of each cell is indeed zero. When the function works on my 23,000 cases dataset, the count looks more reasonable, ranging from a few cases to some hundreds. Best, T — Tony Chang, Jan 20 '18 at 22:21

score 0 · Accepted Answer · answered Jan 21 '18 at 01:12

Consider mapply to add new columns in place by iterating date and STATEid elementwise into a defined function. Specifically, mapply produces a matrix of 7 columns which you assign to event2.

dates_calc_fct <- function(b, e) 
  c(sum(event2$CASE [event2$date<= b+30 & event2$date>b & event2$STATEid == e], na.rm=T),
    sum(event2$CASE [event2$date<= b+60 & event2$date>b & event2$STATEid == e], na.rm=T),
    sum(event2$CASE [event2$date<= b & event2$date>b-30 & event2$STATEid == e], na.rm=T),
    sum(event2$CASE [event2$date<= b & event2$date>b-30 & event2$STATEid == e], na.rm=T),
    sum(event2$viold [event2$date<= b+30 & event2$date>b & event2$STATEid == e], na.rm=T),
    sum(event2$viold [event2$date<= b+60 & event2$date>b & event2$STATEid == e], na.rm=T),
    sum(event2$viold [event2$date<= b & event2$date>b-30 & event2$STATEid == e], na.rm=T),
    sum(event2$viold [event2$date<= b & event2$date>b-30 & event2$STATEid == e], na.rm=T)
   )

event2[c("PostAct30", "PostAct60", 
         "PreAct30", "PreAct60",
         "PostVio30", "PostVio60", 
         "PreVio30", "PreVio60")] <- mapply(dates_calc_fct, event$date, event$STATEid)

Thanks a lot, Parfait! This is exactly what I need to learn - mapply and avoid looping. It is way more efficient than my approach. — Tony Chang, Jan 21 '18 at 20:00
LOL - I am new, so my best guess would be upvoting. (However, a new user's vote may not appear until my reputation goes up, some day). Thanks! — Tony Chang, Jan 21 '18 at 23:29
You got it reversed. Selecting an answer and upvoting are two different things. You can select *any* answer on **your** post but cannot upvote until 50 reps. Since this is your post, you can select the best answer that was most helpful (simply check gray checkmark to side) regardless of rep. Read the link above again. Doing so also confirms resolution of this question and informs future readers that my answer worked. — Parfait, Jan 22 '18 at 02:31

R - How to vectorize with apply family function and avoid while/for loops in this case?

1 Answers1

Linked