2

I need to multiply values in a data frame (based on a specific grouping) by a separate matrix that puts some kind of weights on those values. The multiplication is part of a function that I wrote. I know how to do this in the most basic way. But I cannot understand how I can do it in a more realistic setting. I hope my example makes this problem clear.

I have the following example dataset:

set.seed(45)
tibble(site = rep(c(LETTERS[1:3]), each = 6),
       name = rep(c(letters[10:15]), 3),
       size = runif(18)) %>%
  arrange(site, name) -> d_tibble

I also have a matrix that could represent some kind of weights:

d_matrix <- matrix(0, 6, 6)
diag(d_matrix) <- 1
rownames(d_matrix) <- letters[10:15]
colnames(d_matrix) <- letters[10:15]

d_matrix
##   j k l m n o
## j 1 0 0 0 0 0
## k 0 1 0 0 0 0
## l 0 0 1 0 0 0
## m 0 0 0 1 0 0
## n 0 0 0 0 1 0
## o 0 0 0 0 0 1

I also have a function that is supposed to multiply the vector p by the matrix b

test_fct <- function(a, b) {
  p <- a / sum(a)
  sum(p * (p %*% b))
}

Then I want to do something like this, i.e. using my function in summarise():

#d_tibble %>%
#  group_by(site) %>%
#  summarise(y = test_fct(size, b))

But I don't know how to get b,i.e. the matrix, into my custom function so that its column names match the name variable when grouped by site.

One way I tried was to merge the matrix onto the data frame - that way I have everything in one data frame:

d_tibble %>%
  left_join(d_matrix %>%
              as_tibble() %>%
              mutate(name = colnames(d_matrix))) -> tibble_matrix_join

Than I have it all together but I need to somehow access the unique values of the name variable given the site grouping in order to select the correct columns (j, k, l, m, n, o) for the vector/matrix multiplication in my function test_fct():

#tibble_matrix_join %>%
#  group_by(site) %>%
#  summarise(result = test_fct(size, b))

I tried to check if the general set-up works, that is for only one site and including all names in the matrix, and it does:

d_tibble %>% 
    filter(site == "A") %>% 
    pull(size) -> my_x 

test_fct(my_x, d_matrix)
## [1] 0.1858158

my_p <- my_x/sum(my_x)
sum(my_p * (my_p %*% d_matrix))
## [1] 0.1858158
Stefan
  • 727
  • 1
  • 9
  • 24
  • @akrun it all works fine unless the number and values in the `name` variable within the group vary, e.g. `d_tibble %>% slice_sample(n = 12) %>% arrange(site, name) %>% group_by(site) %>% summarise(out = test_fct(size, d_matrix))`. So I somehow have to select the correct `name` values, given the dataset and grouping to set up the correct matrix columns, or I get the error (see example above). – Stefan Nov 10 '22 at 02:05

1 Answers1

1

With the example, all the columns in the d_matrix is found in the 'name' column of the tibble for all the 'site's. If it is not the case, we may do

library(dplyr)
d_tibble %>%
   group_by(site) %>% 
   summarise(out = test_fct(size, d_matrix[intersect(row.names(d_matrix), 
         name), intersect(colnames(d_matrix), 
         name), drop = FALSE]), .groups = "drop")

-output

# A tibble: 3 × 2
  site    out
  <chr> <dbl>
1 A     0.186
2 B     0.264
3 C     0.218

-testing for a smaller data

d_tibble %>% 
  slice_sample(n = 12) %>%
  arrange(site, name) %>% 
  group_by(site) %>% 
   summarise(out = test_fct(size, d_matrix[intersect(row.names(d_matrix), 
         name), intersect(colnames(d_matrix), 
         name), drop = FALSE]), .groups = "drop")

-output

# A tibble: 3 × 2
  site    out
  <chr> <dbl>
1 A     0.227
2 B     0.416
3 C     0.481
akrun
  • 874,273
  • 37
  • 540
  • 662
  • This looks very good and works on the smaller sample! I will check against my actual dataset and get back to you soon. What does the `.groups = "drop"` do? Because when I leave it out it works too (on the smaller dataset). Also, the `drop = FALSE`, is this an argument of the the square brackets `[]`? It also works with it removed. THX! – Stefan Nov 10 '22 at 02:25
  • 1
    @Stefan Regarding the edited comment, 1) `.groups` in `summarise` is described more in [here](https://stackoverflow.com/questions/62140483/how-to-interpret-dplyr-message-summarise-regrouping-output-by-x-override/62140681#62140681), 2), `drop = FALSE` in `matrix/dataframe` when the subset returns only a single row/column, it drops the dimension attributes to become vector, `drop = FALSE`, maintains the dim attribute which may be essential for your `test_fct` function to work – akrun Nov 10 '22 at 03:01
  • Akrun, I hope you could help me answer one more question related to this question: How can I make sure that `d_matrix` as well as `df_tibble$name` are ordered in the same way before doing the vector/matrix multiplication? In my actual bigger dataset I get different results depending on whether I sort the matrix alphabetically first or not, such as: `d_matrix` vs `d_matrix[sort(rownames(d_matrix)), sort(colnames(d_matrix))]`. THX! – Stefan Nov 20 '22 at 00:08
  • Actually I can replace `d_matrix[intersect(row.names(d_matrix), name), intersect(colnames(d_matrix), name), drop = FALSE])` , which is the second argument of the `test_fct()` with `d_matrix[name, name]`. This solved my problem I was describing above in my other comment. – Stefan Nov 20 '22 at 02:38