How to extract rows with similar names into a submatrix?

Question

I am building an asymmetrical matrix of values with the rows being coefficient names and the column the value of each coefficient:

 #                          Set up Row and Column Names.


rows = c("Intercept", "actsBreaks0", "actsBreaks1","actsBreaks2","actsBreaks3","actsBreaks4","actsBreaks5","actsBreaks6",
            "actsBreaks7","actsBreaks8","actsBreaks9","tBreaks0","tBreaks1","tBreaks2","tBreaks3", "unitBreaks0", "unitBreaks1",
            "unitBreaks2","unitBreaks3", "covgBreaks0","covgBreaks1","covgBreaks2","covgBreaks3","covgBreaks4","covgBreaks5",
            "covgBreaks6","yearBreaks2016","yearBreaks2015","yearBreaks2014","yearBreaks2013","yearBreaks2011",
            "yearBreaks2010","yearBreaks2009","yearBreaks2008","yearBreaks2007","yearBreaks2006","yearBreaks2005",
            "yearBreaks2004","yearBreaks2003","yearBreaks2002","yearBreaks2001","yearBreaks2000","yearBreaks1999",
            "yearBreaks1998","plugBump0","plugBump1","plugBump2","plugBump3")
cols = c("Value")

#                           Build Matrix

matrix1 <- matrix(c(1:48), nrow = 48, ncol = 1, byrow = TRUE, dimnames = list(rows,cols))

output:

> matrix1
               Value
Intercept          1
actsBreaks0        2
actsBreaks1        3
actsBreaks2        4
actsBreaks3        5
actsBreaks4        6
actsBreaks5        7
actsBreaks6        8
actsBreaks7        9
actsBreaks8       10
actsBreaks9       11
tBreaks0          12
tBreaks1          13
tBreaks2          14
tBreaks3          15
unitBreaks0       16
unitBreaks1       17
unitBreaks2       18
unitBreaks3       19
covgBreaks0       20
covgBreaks1       21
covgBreaks2       22
covgBreaks3       23
covgBreaks4       24
covgBreaks5       25
covgBreaks6       26
yearBreaks2016    27
yearBreaks2015    28
yearBreaks2014    29
yearBreaks2013    30
yearBreaks2011    31
yearBreaks2010    32
yearBreaks2009    33
yearBreaks2008    34
yearBreaks2007    35
yearBreaks2006    36
yearBreaks2005    37
yearBreaks2004    38
yearBreaks2003    39
yearBreaks2002    40
yearBreaks2001    41
yearBreaks2000    42
yearBreaks1999    43
yearBreaks1998    44
plugBump0         45
plugBump1         46
plugBump2         47
plugBump3         48

and I wish to extract certain rows that share row names (i.e. all rows with "unitBreaks'x'") into a submatrix.

I tried this

est_actsBreaks <- est_coef_mtrx[c("actsBreaks0","actsBreaks1","actsBreaks2","actsBreaks3",
                                  "actsBreaks4","actsBreaks5","actsBreaks6","actsBreaks7",
                                  "actsBreaks8","actsBreaks9"),c("Value")]

but it returns a vector and I need a matrix. I have seen other questions concerning similar procedures but their columns and rows all had identical names and/or values. Is there a way to do the operation I have in mind, such as grep()?

Vincent Guillemot · Accepted Answer · 2021-08-18T09:58:13.573

Welcome to StackOverflow.

As usual in R, there would probably be many ways to do what you request.

EDIT: I realized that my solution was going a little bit too far, sorry about that.

To extract only the rows that contain the pattern "unitBreaks" followed by several numbers, and still keep a matrix structure, you can run the following code. In a nutshell, grep is going to look for the pattern that you need and the argument drop = FALSE is going to make sure that you get a matrix as a result and not a vector.

uniBreakLines <- grep("unitBreaks[0-9]*", rows)
matrix1[uniBreakLines, , drop = FALSE]

Below is the first version of my answer.

First, I create a vector that describes the groups of rows. For this, I remove the numbers at the end of the row names.

grp <- gsub("[0-9]+$", "", rows)

Then, I transform the data matrix into a data-frame (why I do that is explained a little bit later).

dat1 <- data.frame(matrix1)

Finally, I use "split" on the data-frame, with the groups defined earlier. Using split on the data-frame will keep the structure: the result will be a list of data-frames, even though there is only one column.

dat1.split <- split(dat1, grp)

The result is a list of data-frames.

lapply(dat1.split, head)

$actsBreaks
            Value
actsBreaks0     2
actsBreaks1     3
actsBreaks2     4
actsBreaks3     5
actsBreaks4     6
actsBreaks5     7

$covgBreaks
            Value
covgBreaks0    20
covgBreaks1    21
covgBreaks2    22
covgBreaks3    23
covgBreaks4    24
covgBreaks5    25

$Intercept
          Value
Intercept     1

$plugBump
          Value
plugBump0    45
plugBump1    46
plugBump2    47
plugBump3    48

$tBreaks
         Value
tBreaks0    12
tBreaks1    13
tBreaks2    14
tBreaks3    15

$unitBreaks
            Value
unitBreaks0    16
unitBreaks1    17
unitBreaks2    18
unitBreaks3    19

$yearBreaks
               Value
yearBreaks2016    27
yearBreaks2015    28
yearBreaks2014    29
yearBreaks2013    30
yearBreaks2011    31
yearBreaks2010    32

After that, if you still need matrices, you can convert them with the function as.matrix in an "lapply":

matrix1.split <- lapply(dat1.split, as.matrix)

You might want to consider combining your data in a "tibble" with the "grouping" column. You will then be able to use these groups with the group_by function or other functions from the dplyr package (or other packages from the tidyverse).

For example:

library(dplyr)
tib1 <- tibble(rows, simpler_rows, value = 1:48)

And an example on how to use the grouping variable:

tib1 %>%
  group_by(simpler_rows) %>%
  summarize(sum(value))

EDIT bis: what if I don't know the pattern?

I played around a little bit with your example to answer the question (that nobody asked, but still, it's fun!): "what if I don't know the pattern?"

In this case, I would use a distance between the row names. This distance would look like this:

... and would be the output of the following lines of code

library(stringdist)
library(pheatmap)

strdist <- stringdistmatrix(rows)
pheatmap(strdist, border_color = "white", cluster_rows = F, cluster_cols = FALSE, cellwidth = 10, cellheight = 10, labels_row = rows, fontsize_row = 7)

After that, I only need to get the number of cluster, which can be done with a silhouette plot (similar to this one), that tells me that there are 8 clusters of words, which seems about right:

The cluster can be extracted then with the function used to create the silhouette plot (I used hclust and cutree).

I should then be able to convert those data-frames back into matrices correct? — PDiddyA, Aug 17 '21 at 17:31
Sorry I read your question too fast, I updated my answer with two lines of code that aswer exactly your question. If I may ask: why do you absolutely need a matrix at the end of the operation? — Vincent Guillemot, Aug 18 '21 at 09:03
I'm taking the coefficients of a glm with 7 variable "families" and applying them to a dataset to test the efficacy of a model and I figured the easiest way to do so would be to multiply the matrix of each beta by each "x" matrix. — PDiddyA, Aug 18 '21 at 16:16
I see. In that case, a vector will work as fine as a column matrix. — Vincent Guillemot, Aug 20 '21 at 08:09

score 0 · Answer 2 · answered Aug 17 '21 at 17:51

0

Here a solution with dplyr and stringr to extract rownames that contain a certain string. At the end change back to matrix:

library(dplyr)
library(stringr)
df1 <- df %>% 
    filter(!str_detect(rownames(df), "unitBreaks")) 

df1 <- as.matrix(df1)

               Value
Intercept          1
actsBreaks0        2
actsBreaks1        3
actsBreaks2        4
actsBreaks3        5
actsBreaks4        6
actsBreaks5        7
actsBreaks6        8
actsBreaks7        9
actsBreaks8       10
actsBreaks9       11
tBreaks0          12
tBreaks1          13
tBreaks2          14
tBreaks3          15
covgBreaks0       20
covgBreaks1       21
covgBreaks2       22
covgBreaks3       23
covgBreaks4       24
covgBreaks5       25
covgBreaks6       26
yearBreaks2016    27
yearBreaks2015    28
yearBreaks2014    29
yearBreaks2013    30
yearBreaks2011    31
yearBreaks2010    32
yearBreaks2009    33
yearBreaks2008    34
yearBreaks2007    35
yearBreaks2006    36
yearBreaks2005    37
yearBreaks2004    38
yearBreaks2003    39
yearBreaks2002    40
yearBreaks2001    41
yearBreaks2000    42
yearBreaks1999    43
yearBreaks1998    44
plugBump0         45
plugBump1         46
plugBump2         47
plugBump3         48

answered Aug 17 '21 at 17:51

TarJae

72,363
6
19
66

is there a step I am missing in your solution somewhere? When I attempt to execute `df1 <- df %>% filter(!str_detect(rownames(df), "unitBreaks"))` I get an error: `no applicable method for 'filter' applied to an object of class "function"` – PDiddyA Aug 17 '21 at 18:03
have you loaded the packages `library(dplyr) library(stringr)` – TarJae Aug 17 '21 at 18:07
I did: `library(dplyr) Attaching package: ‘dplyr’ The following object is masked from ‘package:MASS’: select The following object is masked from ‘package:car’: recode The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union > library(stringr) > df1 <- df %>% + filter(!str_detect(rownames(df), "unitBreaks")) Error in UseMethod("filter") : no applicable method for 'filter' applied to an object of class "function"` – PDiddyA Aug 18 '21 at 14:04
1

The libraries loaded fine it seems. – PDiddyA Aug 18 '21 at 14:04

How to extract rows with similar names into a submatrix?

2 Answers2

Linked