Subsetting a data frame containing factors, NA values, and wildcards

Question

So I have a large data frame with several different categories, a simplified example is below (The true dataset has 10+ different Tissues, 15+ different unique celltypes with variable length names per tissue, and thousands of genes). The Tissue columns are formatted as factors.

GENENAME    Tissue1     Tissue2     Tissue3
Gene1       CellType_AA CellType_BB CellType_G
Gene2       CellType_AA CellType_BB       <NA>
Gene3       CellType_AA       <NA>        <NA>
Gene4       CellType_AA CellType_BB CellType_G
Gene5             <NA>        <NA>  CellType_G
Gene6             <NA>  CellType_BB CellType_H
Gene7       CellType_AC CellType_BD CellType_H
Gene8             <NA>        <NA>  CellType_H
Gene9       CellType_AC CellType_BD       <NA>
Gene10            <NA>  CellType_BB       <NA>
Gene11            <NA>  CellType_BD CellType_H
Gene12      CellType_AC       <NA>        <NA>
Gene13            <NA>  CellType_E  CellType_I
Gene14      CellType_F  CellType_E  CellType_I
Gene15      CellType_F  CellType_E        <NA>

What I am trying to do is return a subset based on CellTypes present in multiple tissues, and ignore unnecessary columns when I do so. Additionally, I want to use wildcards (in the the example below, CellType_A*, in order to pick up both CellType_AA and CellType_AB), and ignore the other columns when I only specify some of the columns. I want the function to be easily reusable for different combinations of celltypes, so added a seperate variable for each column.

To do this I set up the function below, setting the default value of each variable as "*", thinking that then it would treat any of those columns as valid if I don't specify an input.

Find_CoEnrich <- function(T1="*", T2="*", T3="*"){
  subset(dataset, 
         grepl(T1, dataset$Tissue1)
         &grepl(T2, dataset$Tissue2)
         &grepl(T3, dataset$Tissue3)
         ,select = GENENAME
  )  
}

However when I run the function on only a single column, to test it

Find_CoEnrich(T1="CellType_AA")

It will return only the following:

   GENENAME
1     Gene1
4     Gene4

instead of

1     Gene1
2     Gene2
3     Gene3
4     Gene4

Skipping any rows which contain an NA in another column. Even more mysteriously, if I try with the wildcard, it seemingly ignores the rest of the string and just returns only those rows which have values in every row, even if they don't match the rest of the string, sich as Gene14:

Find_CoEnrich(T1="CellType_A*")

   GENENAME
1     Gene1
4     Gene4
7     Gene7
14   Gene14

I am pretty sure it is the presence of the NA's in the table that is causing problems, but have spent a long time trying to correct this and am running out of patience. If anyone can help it would be much appreciated.

Is `c"*"` supposed to be `c("*")`? Please make sure you've tested your code before posting it in a question, it can be frustrating parsing through syntax errors caused by simple typos in the question, and not always clear that they aren't also errors in your real code. — r2evans, Dec 13 '21 at 15:50
It was a copy paste error betwen versions when I was copying over the example data, sorry, fixed now. — Phil D, Dec 13 '21 at 15:53
It only returns those rows because the others have missing values (`NA`s)! — jsavn, Dec 13 '21 at 15:54
Yes I know, I want to know how to tell the code to only focus on the columns I specify. I thought setting the default variable to the wildcard `*` would make it accept anything in those columns, and would therefore only subset on the variables I specify, but I don't know how to make the wildcard apply to `NA` as well — Phil D, Dec 13 '21 at 15:56
If you expect genes 2 and 3, then that suggests that having `NA` in those fields should allow a match. With that logic, though, that means genes 5, 6, 8, 10, 11, and 13 should also match. I think you need to consider and/or better-communicate how `NA` values should be considered in your logic. — r2evans, Dec 13 '21 at 15:57
BTW, `*` by itself is not truly a valid regex for `grepl`, generally `*` (and `+` and similar "counting" indicators) need to follow something. I suggest you skim through https://stackoverflow.com/a/22944075/3358272 if you really want to use regex patterns. — r2evans, Dec 13 '21 at 15:58
Thanks I will look through that although it is a long article, I am sure I have probably misunderstood wildcards somewhere. And I only want `NA`to allow a match if the `NA`is in a non-specified column, so in the example I posted, I want it to return all instances of `CellType_AA` in the `Tissue1` column, no matter what is in the `Tissue2` or `Tissue3` columns, so genes 5,6,8,10,11 and 13 should not be returned as they have 'NA' in the 'Tissue1` column, not 'CellType_AA'. — Phil D, Dec 13 '21 at 16:11

jsavn · Accepted Answer · 2021-12-15T13:10:39.930

The wildcard character * you intend to use has a specific meaning as a regular expression, which is how you tell grepl which values to accept - it means 0 or more repetitions of the preceding character. Also, I believe you want a boolean OR (|) operation between the grepl expressions, since you want any row where one of the columns matches the pattern.

Here's a perhaps simpler solution using tidyverse, using separate 'row-based filtering' and 'column selection' steps:

library(tidyverse)

dataset <-  # small subset of your data, rows 1-4 should match but not 5
  tribble(
    ~GENENAME,    ~Tissue1,     ~Tissue2,     ~Tissue3,
    "Gene1", "CellType_AA", "CellType_BB", "CellType_G",
    "Gene2", "CellType_AA", "CellType_BB", NA,
    "Gene3", "CellType_AA", NA, NA,
    "Gene4", "CellType_AA", "CellType_BB", "CellType_G",
    "Gene5", NA, NA, "CellType_G"
    )

desired_pattern <- "CellType_A"  # note that this already implies that any other character can follow, e.g. this will match CellType_AA, CellType_AB, etc.

dataset %>%
  select(all_of(c("GENENAME","Tissue1","Tissue2","Tissue3"))) %>%  # the column selection
  filter(if_any(  # this is a tad confusing: return the row if any of the specified columns matches the condition...
    .cols = all_of(c("Tissue1", "Tissue2", "Tissue3")),  # specify which columns to check
    .fns = ~ stringr::str_detect(.x, pattern = desired_pattern)  # specify the condition...str_detect() is basically grepl() under the hood
  ))

To change to matched cell types beginning with A or B, you could change the pattern accordingly:

desired_pattern  <- ""  # this will match any cell type that starts with A or B

EDIT:

To find rows that match BOTH CellType_A in one of the columns and CellType_B in another, you can do two successive filter steps:

dataset %>%
  select(all_of(c("GENENAME","Tissue1","Tissue2","Tissue3"))) %>%  # the column selection
  filter(if_any(  # in this step, keep only rows that contain at least one `CellType_A`
    .cols = all_of(c("Tissue1", "Tissue2", "Tissue3")),  # specify which columns to check
    .fns = ~ stringr::str_detect(.x, pattern = "CellType_A")
  )) %>%
  filter(if_any(  # in this step, keep only rows that contain at least one `CellType_B`
    .cols = all_of(c("Tissue1", "Tissue2", "Tissue3")),  # specify which columns to check
    .fns = ~ stringr::str_detect(.x, pattern = "CellType_B")
  ))

The order of the two filtering steps above doesn't matter (and you can try swapping them round to convince yourself!)

Thanks, this seems to work! How would I modify the pattern if I wanted to return only those rows with say, `CellType_AA` and `CellTypeBB` for example. Also I did try using the `| OR` seperator when working things out myself, but I kept getting an error saying `‘|’ not meaningful for factors` — Phil D, Dec 13 '21 at 16:24
I've added a bit about including multiple matching types, e.g. A or B; as for the question of factors that's a bit tricker - you'll want to convert the factor to its character value first, for example by including `as.character()` inside `grepl` like so: `grepl(T1, as.character(dataset$Tissue1) | grepl(T2, as.character(dataset$Tissue2))` — jsavn, Dec 13 '21 at 16:29
Thanks this helps a lot and works. One thing though, the desired pattern is to identify rows that have CellType_A AND CellType_B, not OR — Phil D, Dec 14 '21 at 12:13
Ah, I see, I was focused on getting the logic of the same condition across multiple columns right; in this case, I would do two steps, 'filtering' for CellType_A first, and CellType_B second (or vice versa, the order doesn't matter) - this way you are left with rows that contain at least one of each — jsavn, Dec 15 '21 at 13:07

Subsetting a data frame containing factors, NA values, and wildcards

1 Answers1