You really shouldn't store multiple words in the same element. Make vectors like this:
genes <- c("gene1","gene2","gene3","gene4","gene5")
Anyway, assuming that you work with a data frame called df
and assuming that your fourth column entries are indeed one single string where genes are separated by commas:
lis <- strsplit(df[,4], ",")
This will give is a list instead of a data frame, where every element contains all the genes separately. Next, make a list of the genes you are interested in (like above). Finally, do:
tab <- sapply(lis,function(x) any(genes %in% x))
Basically, for each row, %in%
will check for each genes
if it is in there. Next, the any
command will return TRUE
if any of the comparisons returns TRUE
. So, if any of the genes is found in x
, then it returns the value TRUE
.
For example:
df <- structure(list(col1 = 1:10, col2 = 1:10, col3 = 1:10, col4 = c("gene1,gene2,gene3",
"gene2,gene3", "gene6,gene8", "gene9,gene10", "gene1,gene2,gene10",
"gene5", "gene3,gene6", "gene1,gene2,gene8", "gene6,gene7", "gene1,gene4"
)), .Names = c("col1", "col2", "col3", "col4"), row.names = c(NA,
-10L), class = "data.frame")
genes <- c("gene1","gene2","gene3","gene4","gene5")
lis <- strsplit(df[,4], ",")
tab <- sapply(lis,function(x) any(genes %in% x))
tab
# [1] TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE
df
# col1 col2 col3 col4
# 1 1 1 1 gene1,gene2,gene3
# 2 2 2 2 gene2,gene3
# 3 3 3 3 gene6,gene8
# 4 4 4 4 gene9,gene10
# 5 5 5 5 gene1,gene2,gene10
# 6 6 6 6 gene5
# 7 7 7 7 gene3,gene6
# 8 8 8 8 gene1,gene2,gene8
# 9 9 9 9 gene6,gene7
# 10 10 10 10 gene1,gene4
Edit: Adjusted script according to clearer description.