2

I've done my best to read up on this, and I think I've found the process that fits best, but if anyone else has any ideas or any functions or different methods for this it would be much appreciated. So I have a list of small data frames of different row lengths with each data frame containing several latitude and longitude coordinates in separate columns. For each item on the list separately, I need to remove a coordinate pair that may be an outlier and then find the mean center of the remaining coordinates (so there should be one coordinate pair for each item on the list in the end.

The way that I've read to do this is to find the mean center of all the lat and longs separately, and then calculate the euclidean distance from that mean center to each of the coordinate pairs and remove the point that's over a desired distance (let's say 100m). And then finally to calculate the mean center of the remaining points as the final outcome. This seems a bit convoluted to me though, so again, if anyone has any suggestions about coordinate outlier removal, that might be better.

Here's some code that I have so far:

dfList <- structure(list(`43` = structure(list(date = c("43 2011-04-06", "43 2011-04-07", "43 2011-04-08"), identifier = c(43, 43, 43), lon = c(-117.23041303, -117.23040817, -117.23039471), lat = c(32.81217294, 32.81218158, 32.81218645)), .Names = c("date", "identifier", "lon", "lat"), row.names = 13:15, class = "data.frame"), `44` = structure(list(date = c("44 2011-04-06", "44 2011-04-07", "44 2011-04-08"), identifier = c(44, 44, 44), lon = c(-117.22864227, -117.22861559, -117.22862265), lat = c(32.81257756, 32.81257089, 32.81257197)), .Names = c("date", "identifier", "lon", "lat"), row.names = 19:21, class = "data.frame"), `46` = structure(list(date = c("46 2011-04-06", "46 2011-04-07", "46 2011-04-08", "46 2011-04-09", "46 2011-04-10", "46 2011-04-11"), identifier = c(46, 46, 46, 46, 46, 46), lon = c(-117.22992617, -117.2289396895, -117.22965116, -117.23003928, -117.229922602, -117.22969664), lat = c(32.81295118, 32.8128226975, 32.81317299, 32.81224457, 32.813018734, 32.81276993)), .Names = c("date", "identifier", "lon", "lat"), row.names = 25:30, class = "data.frame"), `47` = structure(list(date = c("47 2011-04-06", "47 2011-04-07"), identifier = c(47, 47), lon = c(-117.2274484, -117.22747116), lat = c(32.81205838, 32.81207607)), .Names = c("date", "identifier", "lon", "lat"), row.names = 31:32, class = "data.frame")), .Names = c("43", "44", "46", "47"))

lonMean <- lapply(dfList, function(x) mean(x$lon)) #taking mean for longs
latMean <- lapply(dfList, function(x) mean(x$lat)) #taking mean for lats
latLon <- mapply(c, lonMean, latMean, SIMPLIFY=FALSE)#combining coord lists into one

EDIT: So what I need now is to create the distances between all coordinate for each item in the first list and the matching mean coordinate in the second list, and remove any points from the first list that have distances greater than 100. I've used dist and geodist (from the 'gmt') package before, but I'm not sure how to use them with these two lists. And then to further drop a possible outlier. Thanks so much for your help in advance, I'm not the most R savvy person, so any help much appreciated!

Misc
  • 645
  • 1
  • 7
  • 20
  • So what is the problem? – Scott Solmer Jun 26 '14 at 19:55
  • The problem is that I'm not sure how to create the distance matrix from the two lists that I have, and then how to drop outliers (but coordinate pairs of outliers, not just individual coordinates). – Misc Jun 26 '14 at 20:02
  • Okay, it wasn't clear. Try to be as clear and concise as possible. If you haven't yet, please read ["How do I ask a good question?](http://stackoverflow.com/help/how-to-ask) – Scott Solmer Jun 26 '14 at 20:27
  • Thanks @Okuma.Scott, I've edited my question so that it has a more specific question. – Misc Jun 26 '14 at 20:32
  • Defining outliers is a tricky business... From what you say it looks like there is always one outlier. Could there be more than one? Or none at all? You may want to read through the [questions tagged 'outliers'](http://stats.stackexchange.com/questions/tagged/outliers) on CV. – nico Jun 26 '14 at 20:49
  • Well, for my purposes, I'd like to define an outlier as a point that is >100m from the mean center of the group of points. So maybe instead of saying it's an outlier thinking of it as a distance threshold? – Misc Jun 26 '14 at 20:56

1 Answers1

3

Try this.

df <- do.call("rbind", dfList) # Flattens list into data frame, preserving 
                               # group identifier

# This function calculates distance in kilometers between two points
earth.dist <- function (long1, lat1, long2, lat2)
{
rad <- pi/180
a1 <- lat1 * rad
a2 <- long1 * rad
b1 <- lat2 * rad
b2 <- long2 * rad
dlon <- b2 - a2
dlat <- b1 - a1
a <- (sin(dlat/2))^2 + cos(a1) * cos(b1) * (sin(dlon/2))^2
c <- 2 * atan2(sqrt(a), sqrt(1 - a))
R <- 6378.145
d <- R * c
return(d)
}

df$dist <- earth.dist(df$lon, df$lat, mean(df$lon), mean(df$lat))

df[df$dist >= 0.1,] # Filter those above 100m
kng229
  • 473
  • 5
  • 13
  • This seems to be exactly what I want. Although I didn't think there would be so many points within a group over 100m that I'd be left with no points for certain group identifiers. Maybe I'll mess with my distance threshold. Thanks! – Misc Jun 26 '14 at 21:49
  • You're welcome! I wasn't sure if you wanted the group means or the overall means - in my example, I used the overall mean for lat and long. That may affect your results as well. – kng229 Jun 26 '14 at 21:55
  • Oh wait, I did want the mean for each group, not the overall mean. I think I should be able to adapt your code for that though. – Misc Jun 26 '14 at 21:58