R tibble with list of indexes: how to quickly use them?

Question

I'm looking for a fast way to get the sum of a column in a table based on list of indexes in another table.

Here's a reproducible simple example: First create an edge table

fake_edges <- st_sf(data.frame(id=c('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'),
                               weight=c(102.1,98.3,201.0,152.3,176.4,108.6,151.4,186.3,191.2), 
                               soc=c(-0.1,0.7,1.1,0.2,0.5,-0.2,0.4,0.3,0.8), 
                               geometry=st_sfc(st_linestring(rbind(c(1,1), c(1,2))),
                                               st_linestring(rbind(c(1,2), c(2,2))),
                                               st_linestring(rbind(c(2,2), c(2,3))),
                                               st_linestring(rbind(c(1,1), c(2,1))),
                                               st_linestring(rbind(c(2,1), c(2,2))),
                                               st_linestring(rbind(c(2,2), c(3,2))),
                                               st_linestring(rbind(c(1,1), c(1,0))),
                                               st_linestring(rbind(c(1,0), c(0,0))),
                                               st_linestring(rbind(c(0,0), c(0,1)))
                                              )))

tm_shape(fake_edges, ext = 1.3) +
 tm_lines(lwd = 2) +
tm_shape(st_cast(fake_edges, "POINT")) +
  tm_dots(size = 0.3) +
tm_graticules(lines = FALSE)

Then create a network out of the table, and find the least expensive paths from first node to all nodes.

fake_net <- as_sfnetwork(fake_edges)

fake_paths <- st_network_paths(fake_net,
                         from=V(fake_net)[1],
                         to=V(fake_net),
                         weights='weight', type='shortest')

Now, what I'm trying to improve is the process of finding for each row of that fake_paths table

The id of the last edge in the path
The sum of soc for all the edges of the path

What I did was the following (it's quick here with the 9 lines, but takes a long while on a large network):

# Transforming to data.tables makes things a bit faster
fake_p <- as.data.table(fake_paths)
fake_e <- as.data.table(fake_edges)
# ID of the last edge on the path
fake_p$id <- apply(fake_p, 1, function(df) unlist(fake_e[df$edge_paths %>% last(), 'id'], use.names=F))
# Sum of soc
fake_p$result <- to_vec(for (edge in 1:nrow(fake_p)) fake_e[unlist(fake_p[edge, 'edge_paths']), soc] %>% sum())

Ultimately, what I want is that sum of soc that I call result to be joined backed with the original fake_edges

fake_e = left_join(fake_e, 
                   fake_p %>% select(id, result) %>% drop_na(id) %>% mutate(id=as.character(id), result=as.numeric(result)),
                   by='id')
fake_edges$result <- fake_e$result
fake_edges

Simple feature collection with 9 features and 4 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: 0 ymin: 0 xmax: 3 ymax: 3
CRS:           NA

id	weight	soc	geometry	result
a	102.1	-0.1	LINESTRING (1 1, 1 2)	-0.1
b	98.3	0.7	LINESTRING (1 2, 2 2)	0.6
c	201.0	1.1	LINESTRING (2 2, 2 3)	1.7
d	152.3	0.2	LINESTRING (1 1, 2 1)	0.2
e	176.4	0.5	LINESTRING (2 1, 2 2)	NA
f	108.6	-0.2	LINESTRING (2 2, 3 2)	0.4
g	151.4	0.4	LINESTRING (1 1, 1 0)	0.4
h	186.3	0.3	LINESTRING (1 0, 0 0)	0.7
i	191.2	0.8	LINESTRING (0 0, 0 1)	1.5

Could you provide some toy example of both node_paths and edge_paths using `dput`? The `collapse` package or `data.table` often outperform `dplyr` and `base R` when it comes to such use cases, but a complete rewrite might be overkill here. The loop seems to do some redundant actions, for example `unlist(use.names = F)` or using `map_dbl` or `summarise` directly would be preferable. — Donald Seinen, Nov 12 '21 at 14:42
You're right @Donald-seinen, using data.table does indeed speed things up quite a lot. Still a bit slow, but thanks for that tip! — Laurent, Nov 12 '21 at 15:58
Hi! I'm sorry but the question is not that clear. Could you create a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) using toy data or built in data (e.g. roxel data)? — agila, Nov 17 '21 at 11:29
Hi @agila, I'm going to prepare this. Funny, I was just reading [something you wrote](https://github.com/luukvdmeer/sfnetworks/pull/156) when I got your comment! — Laurent, Nov 17 '21 at 12:59
@agila, I hope it's easier to understand with the edit I just made to the question — Laurent, Nov 17 '21 at 14:31

score 2 · Accepted Answer · answered Nov 17 '21 at 17:09

I'm not sure what you are trying to accomplish, but the following procedure should correspond to the process that you describe in the first post.

Load packages

suppressPackageStartupMessages({
  library(sf)
  library(igraph)
  library(tidygraph)
  library(sfnetworks)
  library(tibble)
})

Define fake data

fake_edges <- st_sf(
  data.frame(
    id = c('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'),
    weight = c(102.1, 98.3, 201.0, 152.3, 176.4, 108.6, 151.4, 186.3, 191.2), 
    soc = c(-0.1, 0.7, 1.1, 0.2, 0.5, -0.2, 0.4, 0.3, 0.8), 
    geometry = st_sfc(
      st_linestring(rbind(c(1,1), c(1,2))), 
      st_linestring(rbind(c(1,2), c(2,2))), 
      st_linestring(rbind(c(2,2), c(2,3))), 
      st_linestring(rbind(c(1,1), c(2,1))), 
      st_linestring(rbind(c(2,1), c(2,2))), 
      st_linestring(rbind(c(2,2), c(3,2))), 
      st_linestring(rbind(c(1,1), c(1,0))), 
      st_linestring(rbind(c(1,0), c(0,0))), 
      st_linestring(rbind(c(0,0), c(0,1)))
    )
  )
)

Create a network out of the table, and find the shortest path from first node to all other nodes

fake_net <- as_sfnetwork(fake_edges)
fake_paths <- st_network_paths(
  x = fake_net, 
  from = V(fake_net)[1], 
  to = V(fake_net),
  weights = 'weight', 
  type = 'shortest'
)

Extract the id of the last edge in the path

idx_numeric <- unlist(lapply(fake_paths[["edge_paths"]], tail, n = 1L))
id <- fake_edges[["id"]][idx_numeric]

For each path, compute the sum of soc for all the edges of the path

result <- tapply(
  X = fake_edges[["soc"]][unlist(fake_paths[["edge_paths"]])], 
  INDEX = rep(seq_len(nrow(fake_paths)), times = lengths(fake_paths[["edge_paths"]])), 
  FUN = sum
)

Create a tibble object with columns id and result

my_tbl <- tibble(
  id = id, 
  result = result
)

Run the left join

left_join(fake_edges, my_tbl)
#> Joining, by = "id"
#> Simple feature collection with 9 features and 4 fields
#> Geometry type: LINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: 0 ymin: 0 xmax: 3 ymax: 3
#> CRS:           NA
#>   id weight  soc result              geometry
#> 1  a  102.1 -0.1   -0.1 LINESTRING (1 1, 1 2)
#> 2  b   98.3  0.7    0.6 LINESTRING (1 2, 2 2)
#> 3  c  201.0  1.1    1.7 LINESTRING (2 2, 2 3)
#> 4  d  152.3  0.2    0.2 LINESTRING (1 1, 2 1)
#> 5  e  176.4  0.5     NA LINESTRING (2 1, 2 2)
#> 6  f  108.6 -0.2    0.4 LINESTRING (2 2, 3 2)
#> 7  g  151.4  0.4    0.4 LINESTRING (1 1, 1 0)
#> 8  h  186.3  0.3    0.7 LINESTRING (1 0, 0 0)
#> 9  i  191.2  0.8    1.5 LINESTRING (0 0, 0 1)

I really don't understand the ideas behind the algorithm (so I'm not sure how to simulate a larger network), but I think the same “algorithm” works pretty well on larger networks, can you test it?

Thanks a lot, hat's brilliant, with a dramatic cut in time (from almost 2 minutes to about 20 seconds over a 90K lines table)! The idea is that I now want to use this to generate convex hulls around edges for different thresholds of `result` column on a map. — Laurent, Nov 18 '21 at 08:20
Happy that it was useful! I just want to suggest comparing the two procedures on a few more (slightly larger) networks just to be sure that they always give the same results. — agila, Nov 18 '21 at 08:51

score 0 · Answer 2 · answered Nov 15 '21 at 10:32

Following Donald Seinen's tip, I used data.table to speed things a bit.

library(data.table)
paths_dt = data.table(paths)
edges_dt = data.table(edges)

# Getting the sum of soc for all edges
paths_dt$result <- to_vec(for (edge in 1:nrow(paths_dt)) 
# Getting the id of the last edge
edges_dt[unlist(paths_dt[edge, 'edge_paths']), soc] %>% sum())
paths_dt$id <- apply(paths_dt, 1, function(df) unlist(edges_dt[df$edge_paths %>% last(), 'id'], use.names=F))
# Applying the result to the corresponding edge
edges_dt <- left_join(edges_dt, paths_dt %>% unlist() %>% select(id, result), on=id)

However, even though that's faster than what I was doing before, it still takes a very long time (something like 10 minutes, and I'm only working on a fraction of the amount of data I should be using).

If anyone could propose another hint, I'm still looking for a better way.

R tibble with list of indexes: how to quickly use them?

2 Answers2