Extract rows from data frame which have matches from vector, but matches must be all the way at the end of string in value

Question

I have a data frame like the following:

sampleid <- c("patient_sdlkfjd_2354_CSF_CD19+", "control_sdlkfjd_2632_CSF_CD8+", "control_sdlkfjd_2632_CSF")
values = rnorm(3, 8, 3)

df <- data.frame(sampleid, values)

I also have a vector like the following:

matches <- c("632_CSF_CD8+", "632_CSF").

I want to extract rows in this data frame which contain the matches at the end of the value in the sampleid column. From this example, you can see why the end of string is important,as I have two samples which contain "632_CSF," but they are distinct samples. If I chose to change matches to only:

matches <- c("632_CSF").

Then I want only the third row of the data frame to be outputted, because this is the only one where this matches at the end of the sampleid.

How can this be achieved?

Thanks!

To tell regex you want your pattern to be at the end of a string use `$`: `632_CSF$` — Justinas Marozas, Feb 03 '18 at 20:02

score 2 · Answer 1 · answered Feb 03 '18 at 20:03

2

Just use $ in your pattern to indicate that it occurs at the end of the string.

grep("632_CSF$", sampleid, value=TRUE)
[1] "control_sdlkfjd_2632_CSF"

answered Feb 03 '18 at 20:03

G5W

36,531
10
47
80

score 1 · Answer 2 · answered Feb 03 '18 at 21:06

You can make this with stringr and some manipulations.

You need to encode regex, it's done with quotemeta function.

Next step would be to append $ to ensure the match is in the end of the string and then concatenate all matches into one with regex OR - |.

And then it should be used with str_detect to get boolean indices.

library(stringr)

# taken from here
# https://stackoverflow.com/a/14838753/1030110
quotemeta <- function(string) {
  str_replace_all(string, "(\\W)", "\\\\\\1")
}

matches_with_end <- sapply(matches, function(x) { paste0(quotemeta(x), '$') })
joined_matches <- paste(matches_with_end, collapse = '|')

ind <- str_detect(df$sampleid, joined_matches)
# [1] FALSE  TRUE  TRUE

df[ind, ]
#                        sampleid    values
# 2 control_sdlkfjd_2632_CSF_CD8+ 10.712634
# 3      control_sdlkfjd_2632_CSF  7.001628

score 1 · Answer 3 · answered Feb 03 '18 at 23:12

Suggest making your dataset more regular.

library(tidyverse)

df_regular <- df %>%
  separate(
  sampleid,
  into = c("patient_type",
         "test_number",
         "patient_group",
         "patient_id"),
  extra = "merge") %>%
mutate(patient_id = str_pad(patient_id, 9, side = c("left"), pad = "0"))

df_regular

df_regular %>%
  filter(patient_group %in% "2632" & patient_id %in% "000000CSF")

Extract rows from data frame which have matches from vector, but matches must be all the way at the end of string in value

3 Answers3