So I have a large data frame with several different categories, a simplified example is below (The true dataset has 10+ different Tissues, 15+ different unique celltypes with variable length names per tissue, and thousands of genes). The Tissue columns are formatted as factors.
GENENAME Tissue1 Tissue2 Tissue3
Gene1 CellType_AA CellType_BB CellType_G
Gene2 CellType_AA CellType_BB <NA>
Gene3 CellType_AA <NA> <NA>
Gene4 CellType_AA CellType_BB CellType_G
Gene5 <NA> <NA> CellType_G
Gene6 <NA> CellType_BB CellType_H
Gene7 CellType_AC CellType_BD CellType_H
Gene8 <NA> <NA> CellType_H
Gene9 CellType_AC CellType_BD <NA>
Gene10 <NA> CellType_BB <NA>
Gene11 <NA> CellType_BD CellType_H
Gene12 CellType_AC <NA> <NA>
Gene13 <NA> CellType_E CellType_I
Gene14 CellType_F CellType_E CellType_I
Gene15 CellType_F CellType_E <NA>
What I am trying to do is return a subset based on CellTypes present in multiple tissues, and ignore unnecessary columns when I do so. Additionally, I want to use wildcards (in the the example below, CellType_A*
, in order to pick up both CellType_AA
and CellType_AB
), and ignore the other columns when I only specify some of the columns. I want the function to be easily reusable for different combinations of celltypes, so added a seperate variable for each column.
To do this I set up the function below, setting the default value of each variable as "*"
, thinking that then it would treat any of those columns as valid if I don't specify an input.
Find_CoEnrich <- function(T1="*", T2="*", T3="*"){
subset(dataset,
grepl(T1, dataset$Tissue1)
&grepl(T2, dataset$Tissue2)
&grepl(T3, dataset$Tissue3)
,select = GENENAME
)
}
However when I run the function on only a single column, to test it
Find_CoEnrich(T1="CellType_AA")
It will return only the following:
GENENAME
1 Gene1
4 Gene4
instead of
1 Gene1
2 Gene2
3 Gene3
4 Gene4
Skipping any rows which contain an NA
in another column. Even more mysteriously, if I try with the wildcard, it seemingly ignores the rest of the string and just returns only those rows which have values in every row, even if they don't match the rest of the string, sich as Gene14
:
Find_CoEnrich(T1="CellType_A*")
GENENAME
1 Gene1
4 Gene4
7 Gene7
14 Gene14
I am pretty sure it is the presence of the NA
's in the table that is causing problems, but have spent a long time trying to correct this and am running out of patience. If anyone can help it would be much appreciated.