1

friends,

I am new in R programming. I have been trying to write a user-defined function for days but not yet nailed it. This is a dataset called event, containing thousands of events (observations) and I selected several rows to show you the data structure. It contains the "STATEid," "date" of occurrence, and geographical coordinates in two variables "LON" "LAT."

I am writing to calculate a new variable (column) for each row. This new variable should be: "Given any specific incident, count the rest of the dataset and calculate the number of events that's happened in the same state, within the circle of 50/100KM radius, in the next 30/60 days."

tail(event[,c("STATEid", "date", "LON", "LAT")])
         STATEid       date        LON      LAT
23611       ohio 1968-04-08  -80.64952 41.09978
23612    arizona       <NA> -112.00000 33.00000
23613   michigan 1970-05-12  -83.61299 42.24115
23614   michigan 1969-02-20  -83.61299 42.24115
23615 california 1984-11-04 -121.61691 39.14045
23616   illinois 1979-09-29  -87.83285 42.44613

I have been writing some of the functions like below,

PostVio30 = function (x) {sum(event$viold [event$date<= x+30 &event$date>x], na.rm=T)}
PostAct60 = function (x) {sum(event$CASE  [event$date<= x+60 &event$date>x], na.rm=T)}
PostVio60 = function (x) {sum(event$viold [event$date<= x+60 &event$date>x], na.rm=T)}

but they are not dynamically calculating for each row.....

The result is correct when entering a specific date and state ---- for example, when I enter "Alabama" and "1966-1-1" it correctly tells me there are 22 incidents occurred in the next 60 days. But how to lapply/sapply/mapply it to each row and ask it to calculate? And how to avoid manually enter the date/state information, please?

> POSTCOUNTING = function(ANYDATE, DATASET, N) {
+   {sum(DATASET$CASE[DATASET$date <= ANYDATE + N & DATASET$date>ANYDATE], na.rm=T)}
+ }
> PRECOUNTING = function(ANYDATE, DATASET, N) {
+   {sum(DATASET$CASE[DATASET$date < ANYDATE & DATASET$date>= ANYDATE - N], na.rm=T)}
+ }
> POSTCOUNTING(as.Date("1966-1-1"), X$alabama, 60)
[1] 22
> PRECOUNTING(as.Date("1966-1-1"), X$alabama, 60)
[1] 9

Alternatively, I have tried to make writing the function easier, with less conditions. For example, I tried to avoid writing statements on "STATEid" by splitting the date first:

X <- split(event, event$STATEid)
PostVio30 = function (x) {sum(event$viold [event$date<= x+30 &event$date>x], na.rm=T)}
X2 <- lapply(X, function(i) {i$PostVio30 = sapply(i$date, PostVio30)})

So I am here trying to learn from your wisdom. If you want I can share the data to give you a reproducible file.

Also - geographical distance calculation is somewhat tricky to me as well - this page identifies a function called gdist maybe plausible?

(Loop over a data.table rows with condition)

locations[, if (gdist(-159.58, 21.901, location_lon, location_lat, units="m") <= 50) .SD, id]
##    id location_lon location_lat
## 1: 11      -159.58       21.901

Thanks so much.

[Replying to another thread: Yes - the coordinates can vary within a state. The incidents could happen in different towns.]

My dput outcome looks like so:

> dput(tail(event[,c("STATEid", "date", "LON", "LAT")]))
structure(list(STATEid = structure(c(36L, 3L, 23L, 23L, 5L, 14L
), .Label = c("alabama", "alaska", "arizona", "arkansas", "california", 
"colorado", "connecticut", "delaware", "district of columbia", 
"florida", "georgia", "hawaii", "idaho", "illinois", "indiana", 
"iowa", "kansas", "kentucky", "louisiana", "maine", "maryland", 
"massachusetts", "michigan", "minnesota", "mississippi", "missouri", 
"montana", "nebraska", "nevada", "new hampshire", "new jersey", 
"new mexico", "new york", "north carolina", "north dakota", "ohio", 
"oklahoma", "oregon", "pennsylvania", "rhode island", "south carolina", 
"south dakota", "tennessee", "texas", "utah", "vermont", "virginia", 
"washington", "west virginia", "wisconsin", "wyoming"), class = "factor"), 
    date = structure(c(-633, NA, 131, -315, 5421, 3558), class = "Date"), 
    LON = c(-80.6495194, -112, -83.6129939, -83.6129939, -121.6169108, 
    -87.8328505), LAT = c(41.0997803, 33, 42.2411499, 42.2411499, 
    39.1404477, 42.4461322)), .Names = c("STATEid", "date", "LON", 
"LAT"), row.names = 23611:23616, class = "data.frame")

Best,

Tom

(A quick update: problem solved - please see here and thanks to all community members: R - How to vectorize with apply family function and avoid while/for loops in this case?)

Tony Chang
  • 23
  • 5

0 Answers0