Matching and indexing through two dataframes and one matrix

Question

I have a dataframe events with xy-coords of unique points.
I have a dataframe all_nodes with xy-coords of network nodes. All points of events are also in all_nodes, but not necessarily only once, and at different positions, i.e., the index (row id) of a point in events does not correspond to all_nodes.
I have a matrix ma of dimension nrow(all_nodes) times nrow(all_nodes) with calculated pairwise interaction terms between all nodes. marows and cols correspond with the index (row_ids) of all_nodes.

My overall goal is to identify the row ids of events in all_nodes. With this I am aiming to extract a submatrix of pairwise interaction from my matrix ma according to the detected row ids. Finally I want to change the order of the submtarix such that the ids and correponding points correspond to events. Any kind of help (code/reference/hint) is much appreciated!

Toy data (you can find real data below)

# coords of unique events 
events <- data.frame(x = c(1,2,3,4),
                     y = c(4,3,2,1))
# all_nodes 
all_nodes <- data.frame(x = c(2,1,120,3,150,4,1),
                     y = c(3,4,120,2,150,1,4))
# matrix corresponding to the index of all_nodes
ma <- matrix(data = rnorm(n = 49, mean = 3, sd = 1), 
             nrow = nrow(all_nodes), ncol = nrow(all_nodes))
ma[6, ] <- ma[2, ]

My effort which isn't quite helpful, since I ran in several problems.

# coords of unique events 
events # see toy data

# ------------------------------------------------
# from object g of class  "sfnetwork" "tbl_graph" "igraph" 
# all rounded coords of nodes; from g ma is used 
# in several steps 
# cols and rows in ma correspond to node ids of g/all_nodes

# all_nodes <- g %>% tidygraph::activate("nodes") %>%
# as.data.frame(geometry)
# all_nodes <- as.data.frame(matrix(unlist(all_nodes$geometry), ncol = 2, byrow = TRUE))
# names(all_nodes) <- c('x', 'y')
# all_nodes <- round(all_nodes, 2)
# --------------------------------------------------

# matching based on x-coord only 
ix <- which(all_nodes$x %in% events$x)
# Problem A
length(ix) == nrow(events) # different length
# Problem B
# and the event with coords x=1, y=4 occurs twice in ix 

sub <- ma[ix, ix]
# If problems A+B were eleminated, sub would correspond to 
# all events, but I different indexing makes it unusable  #(several permutations possible)

I also played around with st_equals {sf} to compare geometries directlly using events <- sf::st_as_sf(events[, c('x', 'y')], coords = c('x', 'y')) in a previous step.

Real data

# removed

Have a look at [match two data.frames based on multiple columns](https://stackoverflow.com/q/26596305/10488504). — GKi, Apr 20 '22 at 06:40

score 3 · Answer 1 · answered Apr 20 '22 at 06:36

3

interaction could be used to match on multiple columns.

idx <- match(interaction(events), interaction(all_nodes))
ma[idx,idx]

answered Apr 20 '22 at 06:36

GKi

37,245
2
26
48

ThomasIsCoding · Accepted Answer · 2022-04-23T21:31:40.240

2

Probably we should do the match task like below

idx <- match(do.call(paste, events), do.call(paste, all_nodes))
ma[idx,idx]

or

idx <- match(asplit(events, 1), asplit(all_nodes, 1))
ma[idx, idx]

Benchmark

TIC1 <- function() {
    match(do.call(paste, events), do.call(paste, all_nodes))
}

TIC2 <- function() {
    match(asplit(events, 1), asplit(all_nodes, 1))
}


GKi <- function() {
    match(interaction(events),interaction(all_nodes))
}

library(bench)
bm <- mark(
    TIC1(),
    TIC2(),
    GKi()
)
autoplot(bm)

gives

> bm
# A tibble: 3 x 13
  expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 TIC1()     175.9us 197.5us     4573.        0B     24.5  2052    11      449ms
2 TIC2()      30.2us  32.1us    28884.        0B     14.4  9995     5      346ms
3 GKi()      311.2us 349.1us     2741.    1.53KB     27.1  1212    12      442ms
# ... with 4 more variables: result <list>, memory <list>, time <list>,
#   gc <list>

and

edited Apr 23 '22 at 21:31

answered Apr 19 '22 at 08:33

ThomasIsCoding

96,636
9
24
81

Thank you for that brilliant piece of code. Using this and the toy data I come up with the problem, that `length(idx)`is not equal to `nrow(events)`, since the point (1,4) occurs twice in `all_nodes`. As a result the ids of `ma[idx, idx]` do not match the points of `events` (shifted index). My goal is a submatrix where the order of points correspond to the order of `events` and each point of `events` occurs only once (in the correct/initial order). – Pax Apr 19 '22 at 09:11
1

@Pax Thanks for the feedback. If you have conflicts, e.g., having point (1,4) twice, what shall we do then? Shall we choose one only, randomly? – ThomasIsCoding Apr 19 '22 at 09:45
Only one is important. Such that the submatrix corresponds to the points of `events` according to the order of points in `events`. Which to choose does not play a role, since the calculated interpoint interactions are identical. – Pax Apr 19 '22 at 10:00
To make this more clear, I added the toy data by the line `ma[6, ] <- ma[2, ]`. – Pax Apr 19 '22 at 10:12
1

@Pax Perhaps we need `unique` to remove duplicates, please see my update – ThomasIsCoding Apr 19 '22 at 10:33
Unfortunately, in my real data apporach `head(events)` is different from `head(all_nodes[idx, ])`. I attached the real nodes and real events to my Q. Sorry for that! – Pax Apr 19 '22 at 12:33
@Pax You can see if my updated answers works for you – ThomasIsCoding Apr 19 '22 at 12:52
Unfortunately, `head(events)` and `head(all_nodes[idx, ]`remains different. Therefore, I conclude I cannot work with the submatrix based on `[idx, idx]` as the ordering (index of points) is different from `events` – Pax Apr 19 '22 at 12:58
@Pax Sorry, I forgot to remove `unique` in my updates. Please try it again. – ThomasIsCoding Apr 19 '22 at 13:03

Matching and indexing through two dataframes and one matrix

2 Answers2

Benchmark