Optimize a subset operation of a nested list

Question

is it possible to improve the speed of the last subset operation in this code? This code fetches a small portion of Open Streetmap data, searches for all the roads that have names and creates a new osm o bject that only contains the roads. Im interrested in optimizing the last bit of the code:

highway_subset <- subset(muc, ids = highway_subset_ids)

class(muc)

[1] "osmar" "list"

muc is a list of lists end each element of the list has an id that is used to create a subset.

Here is the complete example:

library("osmar")
src <- osmsource_api(url = "https://api.openstreetmap.org/api/0.6/")
muc_bbox <- center_bbox(11.575278, 48.137222, 1000, 1000)
muc <- get_osm(muc_bbox, src)

highway_subset_ids <- subset(muc, way_ids = find(muc, way(tags(k == "highway"))))
highway_subset_ids <- find(highway_subset_ids, way(tags(k == "name")))
highway_subset_ids <- find_down(muc, way(highway_subset_ids))
highway_subset <- subset(muc, ids = highway_subset_ids)

Thank you very much in advance.

UPDATE

If you have trouble with ssl, try to copy paste the following code example. It is as minimum as i could make it.

The line i would like to optimize is this one:

final_subset <- subset(highway_subset, ids = highway_subset_ids)

library("osmar")

highway_subset <-
  structure(list(nodes = structure(list(
          attrs = structure(
            list(
              id = numeric(0),
              visible = character(0),
              timestamp = structure(
                list(
                  sec = numeric(0),
                  min = integer(0),
                  hour = integer(0),
                  mday = integer(0),
                  mon = integer(0),
                  year = integer(0),
                  wday = integer(0),
                  yday = integer(0),
                  isdst = integer(0),
                  zone = character(0),
                  gmtoff = integer(0)
                ),
                class = c("POSIXlt", "POSIXt")
              ),
              version = numeric(0),
              changeset = numeric(0),
              user = structure(integer(0), .Label = character(0), class = "factor"),
              uid = structure(
                integer(0),
                .Label = c("2455020", "2590140", "367380"),
                class = "factor"
              ),
              lat = numeric(0),
              lon = numeric(0)
            ),
            row.names = integer(0),
            class = "data.frame"
          ),
          tags = structure(
            list(
              id = numeric(0),
              k = structure(integer(0), .Label = character(0), class = "factor"),
              v = structure(integer(0), .Label = character(0), class = "factor")
            ),
            row.names = integer(0),
            class = "data.frame"
          )
        ),
        class = c("nodes", "osmar_element", "list")
      ),
      ways = structure(
        list(
          attrs = structure(
            list(
              id = c(105071009, 366457476),
              visible = c("true", "true"),
              timestamp = structure(
                list(
                  sec = c(10, 48),
                  min = c(54L, 15L),
                  hour = c(13L, 20L),
                  mday = c(4L, 15L),
                  mon = c(2L, 4L),
                  year = 117:116,
                  wday = c(6L, 0L),
                  yday = c(62L, 135L),
                  isdst = 0:1,
                  zone = c("CET", "CEST"),
                  gmtoff = c(NA_integer_, NA_integer_)
                ),
                class = c("POSIXlt", "POSIXt")
              ),
              version = c(15, 5),
              changeset = c(46573027, 39338422),
              user = structure(
                2:1,
                .Label = c("bjoern262", "saerdnaer"),
                class = "factor"
              ),
              uid = structure(
                4:3,
                .Label = c("367380",
                           "64536", "651621", "6998"),
                class = "factor"
              )
            ),
            row.names = c(2L,
                          4L),
            class = "data.frame"
          ),
          tags = structure(
            list(
              id = c(
                105071009,
                105071009,
                105071009,
                105071009,
                105071009,
                105071009,
                105071009,
                105071009,
                105071009,
                105071009,
                105071009,
                366457476,
                366457476,
                366457476,
                366457476,
                366457476
              ),
              k = structure(
                c(1L, 2L, 3L,
                  4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 3L, 5L, 6L, 7L, 11L),
                .Label = c(
                  "conveying",
                  "description",
                  "highway",
                  "incline",
                  "indoor",
                  "layer",
                  "level",
                  "oneway",
                  "operator",
                  "ref",
                  "tunnel"
                ),
                class = "factor"
              ),
              v = structure(
                c(6L,
                  9L, 10L, 4L, 11L, 3L, 2L, 11L, 8L, 7L, 11L, 5L, 11L, 1L, 1L,
                  11L),
                .Label = c(
                  "-3",
                  "-3;-4",
                  "-4",
                  "down",
                  "footway",
                  "forward",
                  "MP19",
                  "MVG",
                  "Rolltreppe MP19",
                  "steps",
                  "yes"
                ),
                class = "factor"
              )
            ),
            row.names = 4:19,
            class = "data.frame"
          ),
          refs = structure(
            list(
              id = c(105071009, 105071009, 366457476,
                     366457476, 366457476),
              ref = c(3270556979, 1211172719, 3270556979,
                      3704371485, 3704371444)
            ),
            row.names = c(20L, 21L, 68L, 69L,
                          70L),
            class = "data.frame"
          )
        ),
        class = c("ways", "osmar_element",
                  "list")
      ),
      relations = structure(
        list(
          attrs = structure(
            list(
              id = numeric(0),
              visible = character(0),
              timestamp = structure(
                list(
                  sec = numeric(0),
                  min = integer(0),
                  hour = integer(0),
                  mday = integer(0),
                  mon = integer(0),
                  year = integer(0),
                  wday = integer(0),
                  yday = integer(0),
                  isdst = integer(0),
                  zone = character(0),
                  gmtoff = integer(0)
                ),
                class = c("POSIXlt", "POSIXt")
              ),
              version = numeric(0),
              changeset = numeric(0),
              user = structure(integer(0), .Label = character(0), class = "factor"),
              uid = structure(
                integer(0),
                .Label = c(
                  "137242",
                  "161619",
                  "2455020",
                  "2590140",
                  "531886",
                  "72235",
                  "8748",
                  "9451067"
                ),
                class = "factor"
              )
            ),
            row.names = integer(0),
            class = "data.frame"
          ),
          tags = structure(
            list(
              id = numeric(0),
              k = structure(integer(0), .Label = character(0), class = "factor"),
              v = structure(integer(0), .Label = character(0), class = "factor")
            ),
            row.names = integer(0),
            class = "data.frame"
          ),
          refs = structure(
            list(
              id = numeric(0),
              type = structure(integer(0), .Label = character(0), class = "factor"),
              ref = numeric(0),
              role = structure(integer(0), .Label = character(0), class = "factor")
            ),
            row.names = integer(0),
            class = "data.frame"
          )
        ),
        class = c("relations",
                  "osmar_element", "list")
      )
    ),
    class = c("osmar", "list")
  )
highway_subset_ids <- find_down(highway_subset, way(highway_subset$ways$attrs$id))
final_subset <- subset(highway_subset, ids = highway_subset_ids)

Thank you!

`get_osm(muc_bbox, src)` fails with message "error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version" — Uwe, Aug 21 '20 at 06:06
Uwe i updated the question and added a version where you dont need a connection. It is minimal though. — Andreas, Aug 21 '20 at 12:50
thank your for updating the question but I need more guidance / context / background / motivation to go into the right direction. What is the objective of your optimization? Speed? Memory consumption? Why do you find an optimization is required, etc? Thank you. — Uwe, Aug 21 '20 at 13:20
If speed is an issue, have you tried to profile your production code? — Uwe, Aug 21 '20 at 13:21
it´s about speed. I did profile my code and this part is currently is a problem. I have to make subsets on a large data several times and those times add up. Memory is not the issue. — Andreas, Aug 21 '20 at 13:28
[This](https://stackoverflow.com/a/50433477/9841389) might be of interest regarding the ssl issues. — ismirsehregal, Aug 25 '20 at 14:41

score 2 · Accepted Answer · answered Aug 25 '20 at 16:57

2

I profiled your code

library("osmar")
src <- osmsource_api(url = "https://api.openstreetmap.org/api/0.6/")
muc_bbox <- center_bbox(11.575278, 48.137222, 1000, 1000)
muc <- get_osm(muc_bbox, src)

system.time(
  highway_subset_ids <- subset(muc, way_ids = find(muc, way(tags(k == "highway"))))
)
# 0.157
system.time(
  highway_subset_ids <- find(highway_subset_ids, way(tags(k == "name")))
)
# 0.001
system.time(
  highway_subset_ids <- find_down(muc, way(highway_subset_ids))
)
# 0.008
system.time(
  highway_subset <- subset(muc, ids = highway_subset_ids)
)
# 0.025

As you can see, for me the last subset is not the bottleneck but the first one is (6 times more expensive).

The internal data are not really big

nodes 15157 rows
ways 2938 rows
tags 11966 rows
relations 350 rows
another tags 3270 rows

You mentioned you need to do subset multiple times. The problem to address might be to try to "vectorize" your code. I don't mean obvious lapply but extracting internal data.frames, rbinding them, then doing subset only once, and if needed splitting them again. data.table can be employed here to bring extra speed. It will be much more beneficial than using data.table subset in a loop on 15000 rows only, where benefits will be much smaller.

To understand how to "vectorize" that code, you need to understand how osmar subset works. That is not that difficult if you look at source code https://github.com/cran/osmar/blob/master/R/osmar-subsetting.R

try to take out data.frames from all objects to subset
rbindlist them
subset them using [.data.table
split if needed
turn into original objects if needed

Also note that osmar package is quite old, dated 2013, it has indirect dependencies of package like sp which is quite actively developed. You may expect some issues related to breaking changes that may have been introduced in osmar dependencies in last 7 years.

answered Aug 25 '20 at 16:57

jangorecki

16,384
4
79
160

Thank you very much for the suggestions. What dou you mean by "subset them using [.data.table" ? I mean could you please provide a small example ? – Andreas Aug 25 '20 at 17:15
Would it be possible for you to provide an example of vectorizing the code on the function subset_ways from osmar code? Thank you in advance! – Andreas Aug 25 '20 at 17:20
1

Regarding first subset: it is not a problem in my code since it is an operation i do only once, but the last subset is done repeadedly – Andreas Aug 25 '20 at 18:09
Please extend your example code by providing loop that does the last subset. – jangorecki Aug 26 '20 at 05:31
Reproducible if possible – jangorecki Aug 26 '20 at 05:47
Im sorry but that loop doesn't add any new information sadly. Furthermore it's difficult to provide a data set of osm objekt since even for a couple of nodes it creates a huge print out so i can't put it here. If you don't have an ssl issue, the example in original question is complete and reproducible. – Andreas Aug 26 '20 at 05:56
1

I don't have ssl problem, your code runs fine for me. I just don't want to be guessing what you are precisely doing before investing my time into it. – jangorecki Aug 26 '20 at 17:43
Ok. Thank you very much for your effort and suggestions so far. You don't have to :) As i mentioned earlier: the problem is that last subset. The whole loop does not give anything additional on information. – Andreas Aug 26 '20 at 18:22
Unfortunately I don't have any ideas to speed up single `subset`, even if possible then speed up will likely to be marginal. Big gains are possible if you get rid of loop and do _batch `subset`_. – jangorecki Aug 27 '20 at 19:02
Hmm yes I was thinking about it. The problem is: I can't know before hand how many steps the loop will take. It's always different. – Andreas Aug 27 '20 at 19:51

score 0 · Answer 2 · answered Aug 22 '20 at 11:08

Yes, it probably is possible. You can see the structure of the osmar objects by entering str(muc) in the console and you can see the code used to do the subsetting by running osmar:::subset.osmar which is made of component functions like osmar:::subset_ways. All of it seems to be written in base R and could be sped up with for instance data.table.

The strategy might be to work out a more efficient way to do this whole set of operations in one go:

highway_subset_ids <- subset(muc, way_ids = find(muc, way(tags(k == "highway"))))
highway_subset_ids <- find(highway_subset_ids, way(tags(k == "name")))
highway_subset_ids <- find_down(muc, way(highway_subset_ids))
highway_subset <- subset(muc, ids = highway_subset_ids)

Exactly where you focus and how you do it depends on the details of the rest of your project and what you are actually trying to do.

Optimize a subset operation of a nested list

2 Answers2