Omit rows based on values in Lat and Lon columns within a certain radius of marker in another column and create a new data frame

Question

I have an excel sheet setup like below and have loaded as a df:

GPS_Lat     GPS_Lon     Location
50.70528    -120.44984  0
50.70528    -120.44984  0
50.70527    -120.44984  0
50.70527    -120.44984  0
50.70526    -120.44984  1
50.70526    -120.44984  1
50.70525    -120.44984  1
50.70525    -120.44984  0
50.70524    -120.44984  0
50.70524    -120.44984  0
50.70523    -120.44984  0
50.70523    -120.44984  0
50.70522    -120.44984  0
50.70522    -120.44984  0
50.70521    -120.44984  1
50.70521    -120.44984  1
50.7052     -120.44985  1
50.7052     -120.44985  1
50.70519    -120.44985  0
50.70519    -120.44986  0
50.70518    -120.44986  0
50.70518    -120.44986  0
50.70517    -120.44987  0
50.70517    -120.44987  0

I would like to keep values within 1 m of the first "1" that is encountered when going down the location column and put those values in a new data frame (named: df-n). If there are other sections with 1 values I would like to split those into separate data frames (named: df-n+1) while only keeping the points within 1 m of that first 1 as well. I would like each new data frame to be numbered sequentially. I am completely stumped on this one.

Identify points below automatically and create new data frames with rows that are within 1 m of those points (or any other defined distance) for each with the original data frame name and a suffix of sequential order. Therefore original "df" second "df-1" and "df-2"

I will eventually be using a radius of 10 km and my data set can be up to 20k rows.

EDIT: To provide clarity for outputs. Using hypothetical distances with a radius cutoff of 2 m.

   Row  GPS_Lat    GPS_Lon     Location hypothetical_dist_1 hypothetical_dist_2 
    1   50.70528    -120.44984  0   4   14
    2   50.70528    -120.44984  0   3   13
    3   50.70527    -120.44984  0   2   12
    4   50.70527    -120.44984  0   1   11
    5   50.70526    -120.44984  1   0   10
    6   50.70526    -120.44984  1   1   9
    7   50.70525    -120.44984  1   2   8
    8   50.70525    -120.44984  0   3   7
    9   50.70524    -120.44984  0   4   6
    10  50.70524    -120.44984  0   5   5
    11  50.70523    -120.44984  0   6   4
    12  50.70523    -120.44984  0   7   3
    13  50.70522    -120.44984  0   8   2
    14  50.70522    -120.44984  0   9   1
    15  50.70521    -120.44984  1   10  0
    16  50.70521    -120.44984  1   11  1
    17  50.7052     -120.44985  1   12  2
    18  50.7052     -120.44985  1   13  3
    19  50.70519    -120.44985  0   14  4
    20  50.70519    -120.44986  0   15  5
    21  50.70518    -120.44986  0   16  6
    22  50.70518    -120.44986  0   17  7
    23  50.70517    -120.44987  0   18  8
    24  50.70517    -120.44987  0   19  9

OUTPUT:

df-1
Row GPS_Lat     GPS_Lon    Location hypothetical_dist_1 
3   50.70527    -120.44984  0       2
4   50.70527    -120.44984  0       1
5   50.70526    -120.44984  1       0
6   50.70526    -120.44984  1       1
7   50.70525    -120.44984  1       2

And

df-2
Row GPS_Lat     GPS_Lon    Location hypothetical_dist_2
13  50.70522    -120.44984  0       2
14  50.70522    -120.44984  0       1
15  50.70521    -120.44984  1       0
16  50.70521    -120.44984  1       1
17  50.7052     -120.44985  1       2

Basically the workflow is as follows: batch load multiple .csv files into a list, name the list items based on the file names, separate each list item using the radius chosen (into filename-1, filename-2, etc.), plot values in the other columns against each other, and export all of the plots.

Please do not post an image of code/data/errors: it cannot be copied or searched (SEO), it breaks screen-readers, and it may not fit well on some mobile devices. Ref: https://meta.stackoverflow.com/a/285557 (and https://xkcd.com/2116/). Please just include the code, console output, or data (e.g., `dput(head(x))` or `data.frame(...)`) directly. — r2evans, Aug 10 '20 at 17:11
Much better, thanks! (Nitnoid, for future edits: it helps if we can just copy/paste as a table, but embedded spaces make that much more difficult: either we have to edit the copied text manually or ... we just give up and don't try. The "gold standard" for R is either `data.frame(...)` (build it programmatically, if possible) or paste the output from `dput(x)`, where `x` is the "smallest required for the question". (In this simple case, though, it is as easy as changing `GPS Lat` to `GPS_Lat`, no space, same for Lon.) — r2evans, Aug 10 '20 at 18:09
Thank you for that info, I wasn't sure how to get the data in a usable format as my actual data frame has 25 more columns and about 15 000 rows. — B. Mack, Aug 10 '20 at 18:27
(If it helps to understand, I can highlight from `GPS_Lat` through `-120.44987 0`, copy, then in R run `read.table(header=T, 'clipboard')` to "get" your data. This does not work when there are embedded spaces or something else going on. Just an fyi, thanks!) — r2evans, Aug 10 '20 at 18:29
@r2evans I am having trouble running your read.table approach on my Mac. Is there a similar approach that you are aware of for Mac users? Thanks! — Eric, Aug 10 '20 at 19:00
right, macs use `"pbcopy"` instead, see https://stackoverflow.com/a/14547110/3358272 — r2evans, Aug 10 '20 at 19:01

r2evans · Answer 1 · 2020-08-10T21:36:48.157

Up front, I typically recommend against storing into individual frames if you plan on processing the individual frames in similar manners. (See https://stackoverflow.com/a/24376207/3358272 for "list of frames" discussions.)

To demonstrate which rows are used/discarded, I'll add $row here. It isn't necessary for (or used by) any of my code, just for demo.

Also, you mention distancing from the first "1", but there is not a preceding "1" for the top 4 rows. Since the rest of my processing assumes that the first row is the meaningful one, I'll copy the first "1" (from row 5 in this case) to be the first row, so that subsequent grouping distance measurements work as expected.

I'll use dplyr here for easy grouping.

library(dplyr)
if (dat$Location[1] != 1) {
  prepended1 <- TRUE # in case we want to discard this copied row later
  # bring the first "1" to the top
  dat <- bind_rows(dat[which(dat$Location == 1)[1],,drop = FALSE], dat)
  dat$row[1] <- 0L
} else prepended1 <- FALSE

dat2 <- dat %>%
  mutate(grp = cumsum(c(TRUE, diff(Location) > 0))) %>%
  group_by(grp) %>%
  mutate(dist = geosphere::distVincentyEllipsoid(cbind(GPS_Lon, GPS_Lat), cbind(cbind(GPS_Lon[1], GPS_Lat[1])))) %>%
  ungroup()
dat2
# # A tibble: 25 x 6
#    GPS_Lat GPS_Lon Location   row   grp  dist
#      <dbl>   <dbl>    <int> <int> <int> <dbl>
#  1    50.7   -120.        1     0     1  0   
#  2    50.7   -120.        0     1     1  2.22
#  3    50.7   -120.        0     2     1  2.22
#  4    50.7   -120.        0     3     1  1.11
#  5    50.7   -120.        0     4     1  1.11
#  6    50.7   -120.        1     5     2  0   
#  7    50.7   -120.        1     6     2  0   
#  8    50.7   -120.        1     7     2  1.11
#  9    50.7   -120.        0     8     2  1.11
# 10    50.7   -120.        0     9     2  2.22
# # ... with 15 more rows

This provides the distance from the "first 1 in a group" (notice the grp variable). From here, it's easy enough to filter by dist and split by grp.

Depending on your follow-on flow, it might be preferable to remain in this single-frame format, using dplyr::group_by, though adapting code-unseen is a different adventure.

None of your data (except the first rows themselves) is within 1m, so for demonstration I'll use "2m".

dat2 %>%
  filter(dist <= 2)
# # A tibble: 11 x 6
#    GPS_Lat GPS_Lon Location   row   grp  dist
#      <dbl>   <dbl>    <int> <int> <int> <dbl>
#  1    50.7   -120.        1     0     1  0   
#  2    50.7   -120.        0     3     1  1.11
#  3    50.7   -120.        0     4     1  1.11
#  4    50.7   -120.        1     5     2  0   
#  5    50.7   -120.        1     6     2  0   
#  6    50.7   -120.        1     7     2  1.11
#  7    50.7   -120.        0     8     2  1.11
#  8    50.7   -120.        1    15     3  0   
#  9    50.7   -120.        1    16     3  0   
# 10    50.7   -120.        1    17     3  1.32
# 11    50.7   -120.        1    18     3  1.32

Again, instead of breaking out into individual variables, I'll preserve it as a list of frames.

lst_of_frames <- dat2 %>%
  filter(dist <= 2) %>%
  split(., .$grp)
lst_of_frames
# $`1`
# # A tibble: 3 x 6
#   GPS_Lat GPS_Lon Location   row   grp  dist
#     <dbl>   <dbl>    <int> <int> <int> <dbl>
# 1    50.7   -120.        1     0     1  0   
# 2    50.7   -120.        0     3     1  1.11
# 3    50.7   -120.        0     4     1  1.11
# $`2`
# # A tibble: 4 x 6
#   GPS_Lat GPS_Lon Location   row   grp  dist
#     <dbl>   <dbl>    <int> <int> <int> <dbl>
# 1    50.7   -120.        1     5     2  0   
# 2    50.7   -120.        1     6     2  0   
# 3    50.7   -120.        1     7     2  1.11
# 4    50.7   -120.        0     8     2  1.11
# $`3`
# # A tibble: 4 x 6
#   GPS_Lat GPS_Lon Location   row   grp  dist
#     <dbl>   <dbl>    <int> <int> <int> <dbl>
# 1    50.7   -120.        1    15     3  0   
# 2    50.7   -120.        1    16     3  0   
# 3    50.7   -120.        1    17     3  1.32
# 4    50.7   -120.        1    18     3  1.32

If you need to break it out, it's easy enough to assign these manually (e.g., lst_of_frames[[2]]).

Let's try this different logic:

Compare every point with every first-1 point. In this data, there are two first-1 points (rows 5 and 15), so we'll compare all 24 rows with those two points.

compare_points <- filter(dat, Location == 1 & lag(Location) == 0)
compare_points
#    GPS_Lat   GPS_Lon Location row
# 1 50.70526 -120.4498        1   5
# 2 50.70521 -120.4498        1  15

A comparison:

lapply(seq_len(nrow(compare_points)), function(ind) {
  dat %>%
    mutate(dist = geosphere::distVincentyEllipsoid(compare_points[ind,2:1], cbind(GPS_Lon, GPS_Lat))) %>%
    filter(dist <= 2)
})
# [[1]]
#    GPS_Lat   GPS_Lon Location row     dist
# 1 50.70527 -120.4498        0   3 1.112426
# 2 50.70527 -120.4498        0   4 1.112426
# 3 50.70526 -120.4498        1   5 0.000000
# 4 50.70526 -120.4498        1   6 0.000000
# 5 50.70525 -120.4498        1   7 1.112426
# 6 50.70525 -120.4498        0   8 1.112426
# [[2]]
#    GPS_Lat   GPS_Lon Location row     dist
# 1 50.70522 -120.4498        0  13 1.112426
# 2 50.70522 -120.4498        0  14 1.112426
# 3 50.70521 -120.4498        1  15 0.000000
# 4 50.70521 -120.4498        1  16 0.000000
# 5 50.70520 -120.4498        1  17 1.317768
# 6 50.70520 -120.4498        1  18 1.317768

I really don't know what you are expecting in the output, B.Mack. Perhaps you can be explicit with your expected results? Do you mean you want the distance between **all** points and the first "1" point, then between **all** points and the second "1" point, etc? — r2evans, Aug 10 '20 at 20:56
What do you mean by *"other sections with 1 values"*? Perhaps I'm over-thinking this grouping. — r2evans, Aug 10 '20 at 21:08
How does my edit look? The results are pretty close (I have one more row per frame). — r2evans, Aug 10 '20 at 21:51
That looks like its getting really close! I just need to have them named based on the original with the suffix 1 and 2 etc. (so dat-1, dat-2 for this). Basically the workflow is as follows: batch load multiple .csv files into a list, name the list items based on the file names, separate each list item using the radius chosen (into filename-1, filename-2, etc.), plot values in the other columns against each other, and export all of the plots. — B. Mack, Aug 10 '20 at 22:45
I'd think splitting a `list` and naming it is straight-forward enough. If you insist on breaking things into individual frames (discouraged!), then a `for` loop with `assign`, `paste0`, and perhaps `seq_along` or one of the `seq` variants will do what you need. — r2evans, Aug 10 '20 at 23:00
I am having difficulties adapting the code to run through my list. Specifically I am getting the error: "Error in UseMethod("filter_") : no applicable method for 'filter_' applied to an object of class "list" In addition: Warning message: `filter_()` is deprecated as of dplyr 0.7.0. Please use `filter()` instead. See vignette('programming') for more help" — B. Mack, Aug 11 '20 at 15:55
I tried compare_points <- lapply(list, function(x) filter(x, Location == 1 & lag(Location) == 0)) and that worked but now I cant get the second part to run? — B. Mack, Aug 11 '20 at 16:05
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/219625/discussion-between-b-mack-and-r2evans). — B. Mack, Aug 11 '20 at 17:17

score 0 · Answer 2 · answered Aug 10 '20 at 18:55

You can get the distances between all points using base R's dist(...) function. 100m in minutes of latitude is approximately 0.009, so you can view the distance_matrix object to determine which points are < 100m apart.

Note that this uses euclidean distance which is OK for short distances in lat/long but you will run into issues if you use this for long distances for more than a few miles.

I have included a reprex for anyone else who wants to try a solution - feel free to add it to your question.

latlong <- c(50.70528, 50.70528, 50.70527, 50.70527, 50.70526, 50.70526, 50.70525, 50.70525, 50.70524, 50.70524, 50.70523, 50.70523, 50.70522, 50.70522, 50.70521, 50.70521, 50.7052, 50.7052, 50.70519, 50.70519, 50.70518, 50.70518, 50.70517, 50.70517, -120.44984, -120.44984, -120.44984, -120.44984, -120.44984, -120.44984, -120.44984, -120.44984, -120.44984, -120.44984, -120.44984, -120.44984, -120.44984, -120.44984, -120.44984, -120.44984, -120.44985, -120.44985, -120.44985, -120.44986, -120.44986, -120.44986, -120.44987, -120.44987, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0)
latlong <- matrix(latlong, nrow = 24)
    
distance_matrix <- as.matrix(dist(latlong[,c(1,2)], method = "euclidean"))

Euclidean distance on lat/lon is never really okay, even for short distances. For example, adding `0.0001` in latitude versus longitude results in either 7 or 11 meters (give or take). If you want to use euclidean, one should really convert to UTM where euclidean math is actually reliable. — r2evans, Aug 10 '20 at 19:00
`geosphere::distVincentyEllipsoid(rbind(dat[1,1:2], dat[1,1:2] + c(0.0001, 0))[,2:1])` and `geosphere::distVincentyEllipsoid(rbind(dat[1,1:2], dat[1,1:2] + c(0, 0.0001))[,2:1])`, for example. — r2evans, Aug 10 '20 at 19:00
I guess I need to check my assumptions! I have used euclidean where small distances are negligible...but you are right it doesn't scale. Thank you for the example. — bstrain, Aug 10 '20 at 19:06

Omit rows based on values in Lat and Lon columns within a certain radius of marker in another column and create a new data frame

2 Answers2