Problem with R performance when I assign new attribute to two different graphs

Question

In my R script, i have a graph object "flights" and then I assign to the edges an attribute "type" with this code:

stats <- summary(E(graph)$weight)

# 1st threshfirstThresh <- as.double(stats["1st Qu."]) 
firstThresh

# 2nd thresh 
secondThresh <- as.double(stats["3rd Qu."])

for (i in 1:length(E(flights))){
  if(E(graph)[i]$weight < firstThresh)
    E(graph)[i]$type <- "C"
  else if (E(graph)[i]$weight < secondThresh)
    E(graph)[i]$type <- "M"
  else
    E(graph)[i]$type <- "L"
  cat(i , " - ")
}

Why with this code a single iteration of the "for" loop is really much slower in if I use another graph with an higher number of nodes and edges?

In particular, i made a simple benchmark in this way:

start.time <- Sys.time()
...Relevent codes...
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

These are the results for 200 loop iterations on the two graphs:

for the first graph: 0.5541661 secs
for the first graph: 26.57538 secs

Why there's so much difference even if the code is the same?

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input that can be used to test and verify possible solutions. — MrFlick, Mar 15 '23 at 19:38
@MrFlick i have the graph stored in a file, and they have 6000 and 1500 nodes. How can i post a reproducible example? — xSambx02, Mar 15 '23 at 19:40
Demonstrate the issue using a small sample graph you can provide code for, and maybe a larger graph with more nodes that can be simulated easily. We need to be able to run your code on **something** in order to try to improve it. — Gregor Thomas, Mar 15 '23 at 19:46
i don't understand ... in the code i don't use any attribute of the graph, i only assign a new attribute , how a graph example can be usefull? i don't even know how to provide the example of the graph because it's an entire graphml file that i cannot attach here — xSambx02, Mar 15 '23 at 19:54
Could you show us the result of: `str(graph)`? Is it possible to reproduce the result with a random graph? — clp, Mar 15 '23 at 20:07
Result of str: - First graph : Class 'igraph' hidden list of 10 $ : num 235 $ : logi FALSE $ : num [1:2101] 136 ... $ : num [1:2101] 0 ... $ : num [1:2101] 58 ... $ : num [1:2101] 267... $ : num [1:236] 0 ... $ : num [1:236] 0 ... $ :List of 4 ..$ : num [1:3] 1 0 1 ..$ : Named list() ..$ :List of 4 .. ..$ x : num [1:235] -9 .. .. ..$ tooltip: chr [1:235] "LIT(lngx=-5,laty=3)" ... .. ..$ y : num [1:235] -347 ... .. ..$ id : chr [1:235] "0" ... ..$ :List of 1 .. ..$ id: chr [1:2101] "0" ... $ : — xSambx02, Mar 15 '23 at 20:12
second graph (part 1): Class 'igraph' hidden list of 10 $ : num 6072 $ : logi TRUE $ : num [1:66923] 2262... $ : num [1:66923] 2284... $ : num [1:66923] 17042... $ : num [1:66923] 17068... $ : num [1:6073] 0... $ : num [1:6073] 0... $ :List of 4 ..$ : num [1:3] 1... ..$ : Named list() ..$ :List of 13 .. ..$ name : chr [1:6072] "Goroka Airport"... .. ..$ city : chr [1:6072] "Goroka"... .. ..$ country : chr [1:6072] "Papua"... .. ..$ ICAO : chr [1:6072] "AYGA"... — xSambx02, Mar 15 '23 at 20:16
second graph (part 2) .. ..$ latitude : num [1:6072] -6... .. ..$ longitude : num [1:6072] 145... .. ..$ altitude : num [1:6072] 5... .. ..$ id : chr [1:6072] "n0"... ..$ :List of 8 .. ..$ airline : chr [1:66923] "2B"... .. ..$ airlineID : chr [1:66923] "410"... .. ..$ sourceAirportID : chr [1:66923] ... .. ..$ destinationAirportID: chr [1:66923]... .. ..$ weight: num [1:66923] 1509 1042 449 771 1340 ... $ : — xSambx02, Mar 15 '23 at 20:16
_Why with this code a single iteration of the "for" loop is really much slower in if I use another graph with an higher number of nodes?_ You are iterating through edges, not nodes. Does the graph where the operation is faster have more edges as well? Or only more nodes? — Szabolcs, Mar 15 '23 at 20:43
@Szabolcs it also has more edges. But I think that it doesn't matter, I executed the benchmark with only 200 edges in both graphs — xSambx02, Mar 15 '23 at 20:50
If this is so, it is very unusual. As others said, please produce a minimal example. — Szabolcs, Mar 15 '23 at 21:02
Minimal example (1) `n <- 235; graph1 <- sample_gnm(n, 2121)`. Takes 0.17 seconds. (2) `n <- 6072; graph2 <- sample_gnm(6072, 66923)`. Takes 46 seconds. This is Windows11. As mentioned in the `igraph` forum,` igraph` is slow for single updates. — clp, Mar 15 '23 at 21:19
Set weight to: `E(graph1)$weight <- round(runif(ecount(graph1)) * 100)`. — clp, Mar 15 '23 at 21:28

Szabolcs · Answer 1 · 2023-03-15T20:41:52.737

0

Please try the following approach, and let us know if it is faster. I did not benchmark this, since I do not have your graph.

> g <- make_(ring(10), with_edge_(weight=1:10))

> E(g)$weight
 [1]  1  2  3  4  5  6  7  8  9 10

> E(g)$kind <- 'C'
> E(g)[weight < 3]$kind <- 'A'
> E(g)[3 <= weight & weight < 5]$kind <- 'B'

> E(g)$kind
 [1] "A" "A" "B" "B" "C" "C" "C" "C" "C" "C"

Be cautious with type as an attribute name. When applied to vertices, it should be of type logical, as it has a special interpretation as encoding the two partitions of a bipartite graph.

edited Mar 15 '23 at 20:41

answered Mar 15 '23 at 20:26

Szabolcs

24,728
9
85
174

So you think that the problem is the attribute name? – xSambx02 Mar 15 '23 at 20:32
@xSambx02 No, the attribute name will not cause performance issues. I clarified the answer. I'll leave the benchmarking to you, but it'll be nice if you can report back with your results. – Szabolcs Mar 15 '23 at 20:42

clp · Accepted Answer · 2023-03-21T14:47:12.070

Regarding igraph efficiency see: https://igraph.discourse.group/t/igraph-is-much-slower-than-networkx-when-generating-a-graph/853.

Much faster is to vectorize the coding:

set.seed(2023)
library(igraph)

firstThresh  <- 20
secondThresh <- 80

system.time({
n <- 6072; m <- 66923
graph           <- sample_gnm(n, m)
E(graph)$weight <- round(runif(m) * 100)

E(graph)$type <- "L"
E(graph)$type[which(E(graph)$weight < secondThresh)] <- "M"
E(graph)$type[which(E(graph)$weight < firstThresh)]  <- "C"
})
# Benchmark system.time.
# user  system elapsed 
# 0.02    0.00    0.02

print.igraph(graph, full="auto.print.lines=0")
# IGRAPH d66a81f U-W- 6072 66923 -- Erdos-Renyi (gnm) graph
# + attr: name (g/c), type (g/c), loops (g/l), m (g/n), weight (e/n), typ (e/c)

Surprise

library(microbenchmark)
microbenchmark(
"one" = (E(graph)[weight < secondTresh]$type <- "M"),
"two" = (E(graph)$type[which(E(graph)$weight < secondTresh)] <- "M")
)
# Unit: microseconds
# expr    min      lq     mean  median      uq     max neval
#  one 2648.5 2776.75 3607.604 2851.95 4149.95 19895.9   100
#  two  847.3  978.10 1364.860 1019.30 1317.25  3171.7   100

I have no explanation for the performance difference. However: In base R, why is selecting column, then filtering rows faster than vice versa: filter rows, then select column?.

Update - Why is vectorized code faster?
See also: Why is vectorization faster.

My casual answer:
Consider x <- y. It is much clearer if the coding is the same regardless of whether x consists of one or more values. In my experience, "vectorized" programming is more difficult to implement but less prone to errors. It is either right or completely wrong. Avoiding for-loops is also faster. The reason is that the time to execute an assignment statement is mainly determined by overhead, as we can see in the following example.

a <- 0; b <- 0
n <- 1E4; va <- runif(n); vb <- runif(n)

library(microbenchmark)
microbenchmark(
"single"    = (a <- b),
"multi"     = (va <- vb )
)

## Unit: nanoseconds
##    expr min  lq mean median  uq  max neval
##  single   0 100   87    100 100  500   100
##   multi   0 100  112    100 100 1400   100
object.size(va)
## 80048 bytes

But more importantly in this example

Within the loop, the graph is copied twice in each iteration, a total of 2 times the number of flights.

tracemem(graph)
E(graph)[i]$type <- "M"
untracemem(graph)
# tracemem[0x00000241f1305358 -> 0x00000241f1329f78]: 
# tracemem[0x00000241f1329f78 -> 0x00000241f1329ec8]: i_set_edge_attr E<-

Note @Szabolcs' solution is more elegant and concise. Therefore use `E(graph)[weight < secondTresh]$type <- "M"` instead of `E(graph)$type[which(E(graph)$weight < secondThresh)] <- "M"`. — clp, Mar 17 '23 at 12:04

Problem with R performance when I assign new attribute to two different graphs

2 Answers2