1

In broad terms, I try to use apply() so that processing one row depends on the results of previously processed rows.

This post is related, but didn't help me build the results.

I want to build a dataframe of unique "locations" from a dataframe of incidents. The incidents are registered with geocoordinates (lon,lat). I've sorted the incidents by lon and lat, then go through them sequentially with apply(). As a result, I want to get something like expectedResult. I check if the geocoordinates of an incident are equal to the geocoordinates of one I've processed previously. If they aren't, I create a new location. If they are, I assume the incident took place at the same location.

My issue is that I don't know how to build the dataframe/list of locations when applying the function to incidents. Before applying the function checkEquals to incidents, I create an initial dataframe locations containing the first location.

In my sample data, row 3 is intentionally a duplicate of 1, so that at least these incidents should be added to the same location.

checkEquals <- function(row,loc){
    prevLoc <- loc[nrow(loc),]
    if (as.numeric(row["lon"]) == as.numeric(prevLoc["lon"]) 
        && as.numeric(row["lat"]) == as.numeric(prevLoc["lat"]))  {
        # if (row == prevLoc) {
        prevLoc["count"] <- as.numeric(prevLoc["count"]) + 1
        loc[nrow(loc),] <- prevLoc
    } else {
        loc[nrow(loc)+1,] <- c(row["id"], row["lon"], row["lat"],count=1)
    }
    locations <<- loc
}

main <- function(){
    incidents <- data.frame(id = c(1,2,3,4), lon = c(-81, -80, -81, -79), lat = c(42, 40, 42, 41) )
    incidents <- incidents[order(incidents$lon, incidents$lat),]
    locations <- data.frame(id=1,lon=incidents[1,]$lon, lat=incidents[1,]$lat, count=0)

    locations <- apply(incidents,1,checkEquals,locations)
    print(locations)
    expectedResult <- data.frame(id = c(1,2,4), lon = c(-81, -80, -79), lat = c(42, 40, 41), count = c(2,1,1))
    print(expectedResult)
}


> main()
$`1`
  id lon lat count
1  1 -81  42     1

$`3`
  id lon lat count
1  1 -81  42     1

$`2`
  id lon lat count
1  1 -81  42     0
2  2 -80  40     1

$`4`
  id lon lat count
1  1 -81  42     0
2  4 -79  41     1

> expectedResult
  id lon lat count
1  1 -81  42     2
2  2 -80  40     1
3  4 -79  41     1

In each iteration of apply(), the program compares against the initial locations. I want locations to change with every iteration, adding rows or modifying existing ones. Apparently the final assignment locations <<- loc doesn't do the trick, nor explicit assign(). In addition, there are still the formatting issues of locations, which is a list of dataframes rather than a dataframe.

Community
  • 1
  • 1
Arto Pihlaja
  • 309
  • 3
  • 5
  • 2
    Please read [how do I ask a good question](http://stackoverflow.com/help/how-to-ask), [How to create a MCVE](http://stackoverflow.com/help/mcve) as well as [how to provide a minimal reproducible example in R](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example#answer-5963610). I suggest your edit your question and provide minimal dummy input data, which abstracts from your specific problem, plus the expected output. – lukeA Apr 22 '16 at 21:10
  • The revised question is reproducible with its sample data and, I think, clear enough. The issue remains open.. – Arto Pihlaja May 05 '16 at 04:40
  • `incidents[!duplicated(incidents[, 2:3]), ]` gives you `expectedResult`. – lukeA May 05 '16 at 16:43
  • Luke, you're right! A simple solution to a simple problem. – Arto Pihlaja May 05 '16 at 18:59
  • Unfortunately, I had simplified the problem a bit too much in this post. Firstly, I had forgotten column 'count' from `expectedResult`. The idea was to count the number of incidents at the same location. Secondly, in the real problem I'm trying to solve, I use a custom function to find coordinates _near_ each other. So, for instance (-81.000000, 42.000000) and (-81.000000, 42.000001) would go to the same location, but they are not duplicates. – Arto Pihlaja May 05 '16 at 19:09
  • #1 Check out `?aggregate`, #2 check out `?round`. – lukeA May 06 '16 at 07:22
  • As to the question about `apply()` in the thread header, no, I don't think it's possible to carry information over loops. In other words, at least I didn't find a way to make processing of row n depend on the results of processing row n-1 when using the `apply()` family. – Arto Pihlaja May 14 '16 at 06:33

1 Answers1

0

You could do

df <- data.frame(id = c(1,2,3,4), 
                 lon = c(-81.0000, -80, -81.0001, -79), 
                 lat = c(42, 40, 42, 41) )
library(dplyr)
df %>% 
  group_by(lon=round(lon, 3), lat=round(lat, 3)) %>% 
  summarise(count=n())
# Source: local data frame [3 x 3]
# Groups: lon [?]
# 
#     lon   lat count
#   (dbl) (dbl) (int)
# 1   -81    42     2
# 2   -80    40     1
# 3   -79    41     1
lukeA
  • 53,097
  • 5
  • 97
  • 100