2

I have a data frame like the following:

sampleid <- c("patient_sdlkfjd_2354_CSF_CD19+", "control_sdlkfjd_2632_CSF_CD8+", "control_sdlkfjd_2632_CSF")
values = rnorm(3, 8, 3)

df <- data.frame(sampleid, values)

I also have a vector like the following:

matches <- c("632_CSF_CD8+", "632_CSF").

I want to extract rows in this data frame which contain the matches at the end of the value in the sampleid column. From this example, you can see why the end of string is important,as I have two samples which contain "632_CSF," but they are distinct samples. If I chose to change matches to only:

matches <- c("632_CSF").

Then I want only the third row of the data frame to be outputted, because this is the only one where this matches at the end of the sampleid.

How can this be achieved?

Thanks!

Keshav M
  • 1,309
  • 1
  • 13
  • 24

3 Answers3

2

Just use $ in your pattern to indicate that it occurs at the end of the string.

grep("632_CSF$", sampleid, value=TRUE)
[1] "control_sdlkfjd_2632_CSF"
G5W
  • 36,531
  • 10
  • 47
  • 80
1

You can make this with stringr and some manipulations.

You need to encode regex, it's done with quotemeta function.

Next step would be to append $ to ensure the match is in the end of the string and then concatenate all matches into one with regex OR - |.

And then it should be used with str_detect to get boolean indices.

library(stringr)

# taken from here
# https://stackoverflow.com/a/14838753/1030110
quotemeta <- function(string) {
  str_replace_all(string, "(\\W)", "\\\\\\1")
}

matches_with_end <- sapply(matches, function(x) { paste0(quotemeta(x), '$') })
joined_matches <- paste(matches_with_end, collapse = '|')

ind <- str_detect(df$sampleid, joined_matches)
# [1] FALSE  TRUE  TRUE

df[ind, ]
#                        sampleid    values
# 2 control_sdlkfjd_2632_CSF_CD8+ 10.712634
# 3      control_sdlkfjd_2632_CSF  7.001628
m0nhawk
  • 22,980
  • 9
  • 45
  • 73
1

Suggest making your dataset more regular.

library(tidyverse)

df_regular <- df %>%
  separate(
  sampleid,
  into = c("patient_type",
         "test_number",
         "patient_group",
         "patient_id"),
  extra = "merge") %>%
mutate(patient_id = str_pad(patient_id, 9, side = c("left"), pad = "0"))

df_regular

df_regular %>%
  filter(patient_group %in% "2632" & patient_id %in% "000000CSF")
Nettle
  • 3,193
  • 2
  • 22
  • 26