0

I have a problem with r. This is a Snapshot of my real dataset:

My Dataset Snapshot

I want to create a variable which indicates if at least one gene from a list of genes that I have is present in column D of my dataset(if its there=1, if not=0).

-an example of a list of genes that interest me : gene<-c("gene1|gene2|gene3|gene4")

the column D in my data set matches a variable that indicates the genes present in each individual(a set of genes per individual per line, separated by ,).

in my real dataset the genes in column D are separated by ,

Which function can I use?

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
Manel
  • 13
  • 1
  • 4

1 Answers1

2

You really shouldn't store multiple words in the same element. Make vectors like this:

genes <- c("gene1","gene2","gene3","gene4","gene5")

Anyway, assuming that you work with a data frame called df and assuming that your fourth column entries are indeed one single string where genes are separated by commas:

lis <- strsplit(df[,4], ",")

This will give is a list instead of a data frame, where every element contains all the genes separately. Next, make a list of the genes you are interested in (like above). Finally, do:

tab <- sapply(lis,function(x) any(genes %in% x))

Basically, for each row, %in% will check for each genes if it is in there. Next, the any command will return TRUE if any of the comparisons returns TRUE. So, if any of the genes is found in x, then it returns the value TRUE.

For example:

df <- structure(list(col1 = 1:10, col2 = 1:10, col3 = 1:10, col4 = c("gene1,gene2,gene3", 
"gene2,gene3", "gene6,gene8", "gene9,gene10", "gene1,gene2,gene10", 
"gene5", "gene3,gene6", "gene1,gene2,gene8", "gene6,gene7", "gene1,gene4"
)), .Names = c("col1", "col2", "col3", "col4"), row.names = c(NA, 
-10L), class = "data.frame")

genes <- c("gene1","gene2","gene3","gene4","gene5")

lis <- strsplit(df[,4], ",")
tab <- sapply(lis,function(x) any(genes %in% x))
tab
# [1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

df
#    col1 col2 col3               col4
# 1     1    1    1  gene1,gene2,gene3
# 2     2    2    2        gene2,gene3
# 3     3    3    3        gene6,gene8
# 4     4    4    4       gene9,gene10
# 5     5    5    5 gene1,gene2,gene10
# 6     6    6    6              gene5
# 7     7    7    7        gene3,gene6
# 8     8    8    8  gene1,gene2,gene8
# 9     9    9    9        gene6,gene7
# 10   10   10   10        gene1,gene4

Edit: Adjusted script according to clearer description.

slamballais
  • 3,161
  • 3
  • 18
  • 29
  • First thank you for your quick answer. I tried your Script but when i use : lis <- strsplit(data1[,1], ",") i have a error message saying that the argument is not a string. i just added a Snapshot of my real Dataset in my first message hope it will help.As you can see in the Snapshot , the variable of interest name is "RefSeqGenes' and the genes inside are separated by , . Also i should precise that my list of genes of interest is pretty long 200 (genes) so i would like to create only one variable where if at least one of the 200 genes of interest is present =1 if not 0 . Thanks – Manel Jan 29 '16 at 04:51
  • I think i resolved the problem of string error by : data1$RefSeqGenes<-as.character(data1$RefSeqGenes) i hope its correct. i have now a new message error when using : colnames(tab) <- genes it say : the length of 'dimnames' [2] is not equal to the length of the table. Also i precise that i can have until 800 genes (separated by,) in some rows in my dataset. thanks – Manel Jan 29 '16 at 05:22
  • I updated the script so that if any of the genes is present it will be set to `TRUE`, otherwise `FALSE`. If you want `1` and `0`, just write `tab <- tab + 0`, which will change tab from a logical vector to a numerical vector while maintaining your numbering. I hope that it works now. If it does, don't forget to click the check mark next to my answer, so that the question will be closed. – slamballais Jan 29 '16 at 07:53
  • I did a check up manually for some genes to be sure that the script is working and it is working!!! Thanks a lot for the help. – Manel Feb 01 '16 at 02:01
  • I would have another question please. What if in place of having only one list of gene that interest me ( list1<- c("gene1","gene2","gene3") i would have 15 list of genes so : list1 <- c("gene1","gene2","gene3") list2 <- c("gene1","gene5","gene3") list3 <- c("gene7","gene2","gene5") list4<- c("gene1","gene2","gene3")....... and the condition is that at least one of those list (all the genes in the list must be there not less) is present completely in the same variable as before in column 4? Thanks a lot – Manel Feb 02 '16 at 03:00
  • `lis <- strsplit(df[,4], ","); totallist <- list(list1,list2,list3); tab <- matrix(NA,nrow(df),length(totallist)); for (i in 1:length(totallist)) tab[,i] <- sapply(lis,function(x) all(totallist[[i]] %in% x))` – slamballais Feb 02 '16 at 10:40
  • Thanks a lot for the quick reply – Manel Feb 02 '16 at 19:26
  • A last question please.Do you think that in R it would be possible to select not only by genes but also by coordinate. as you can see in the snapshot that i have put upper in column 4 i have the start coordinate of the gene and in column 5 the stop coordinate.i want to select for exemple when a row contain a gene X and also have a coordinate starting at 100000 ending at 200000 with an overlap of 80%. it mean that if i have the gene X and the coordinate are (100000-200000 or 80000-180000 or 120000-220000 ...) it will take it. so at least 80% of the region (100000-200000) is present for gene X. – Manel Feb 02 '16 at 20:01
  • That's a completely different kind of question. You may want to Google around for that. For example, this sounds very similar: http://stackoverflow.com/questions/24766104/checking-if-value-in-vector-is-in-range-of-values-in-different-length-vector Also, check out `?findInterval` – slamballais Feb 02 '16 at 20:09
  • Thanks i will post a new topic because i could not find exactly how to do it – Manel Feb 02 '16 at 20:44