-1

I have:

  1. A data frame test_var with ODB6_OD_ID, Gene_ID and some other columns.
  2. A data frame arranged_data with architecture_number, Details and Sequence. (Sequence data is not relevant)

I want to find out if arranged_data$Details contains any value from test_var$ODB6_OD_ID. That is, if the the string in arranged_data$Details contains a substring which is any of the values in test_var$ODB6_OD_ID. If the sequence is present, then append the test_var$Gene_ID of the respective ODB6_OD_ID to a file.

I have to do this for every architecture_number. There are around 18 architectures with the total dataset being 5660.

This is the code that I have written, but it is taking too long, which I assume because of large loops: (This is only for architecture 1. I have to do for architectures from 1-18) (Architecture 1 contains around 350+ sequences)

while (arranged_data$architecture_number==1)
 {
    if(grepl(arranged_data$V2,test_var$ODB6_OG_ID)==TRUE)
    {
        write.table(test_var$Gene_ID, file = "architecture1", append = TRUE, sep = '\n')
     }
}

My dataset looks like: test_var:

ODB6_OG_ID  start   Gene_ID
EOG60024F   chrXR_group6    FBgn0247618
EOG60024H   chr4_group3 FBgn0070413
EOG60024K   chr2    FBgn0078093
EOG60024M   chr2    FBgn0243975
EOG60024V   chr4_group5 FBgn0247694
EOG60025C   chrXL_group1a   FBgn0247949
EOG60025F   chr3    FBgn0245234
EOG602XCD   chr4_group3 FBgn0080574
EOG602XCQ   chr4_group3 FBgn0078791

arranged_data contains:

architecture_number    Details
1    chr317678741767875EOG6HQF5814.8092+47
1    chr325176942517695EOG6NKCGX23.1869-87
1    chr391494069149407EOG6NZVDZ2.96183+105
1    chr246642624664263EOG6Z638J1.52323+138
1    chr4_group3231407231408EOG6QRHQP4.65431-721
1    chr311648221164823EOG6X3HNJ2.28484+96
1    chr333466933346694EOG66WZW582.1698+678
1    chrXR_group854636745463675EOG6XH0KP1.86172+57
1    chr283746518374652EOG6V17MG2.45409-68
1    chr31338293913382940EOG63XVQR1.60785+105

Required output: FBgn0247618 FBgn0070413 FBgn0078093 etc.

(These are not in order.)

Other information: OS: Ubuntu Xenial Xerus 16.04 R Version: 3.3.0 RStudio version: 0.99.902

DarkRose
  • 109
  • 7
  • Please provide a [minimal reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Also, saving the data in a data frame and then writing to disk may be faster than writing to disk at each iteration. – Steve Bronder May 26 '16 at 13:17

1 Answers1

0

Here is one idea,

df$new <- gsub('_.*', '', df$start)
df1$new <- ifelse(grepl('_', df1$Details), gsub('_.*', '', df1$Details), 
                                                   substring(df1$Details, 1, 4))

df1$Gene_ID <- sapply(df1$new, function(i) df$Gene_ID[match(i, df$new)])

df1
#   architecture_number                                       Details   new     Gene_ID
#1                    1         chr317678741767875EOG6HQF5814.8092+47  chr3 FBgn0245234
#2                    1         chr325176942517695EOG6NKCGX23.1869-87  chr3 FBgn0245234
#3                    1        chr391494069149407EOG6NZVDZ2.96183+105  chr3 FBgn0245234
#4                    1        chr246642624664263EOG6Z638J1.52323+138  chr2 FBgn0078093
#5                    1   chr4_group3231407231408EOG6QRHQP4.65431-721  chr4 FBgn0070413
#6                    1         chr311648221164823EOG6X3HNJ2.28484+96  chr3 FBgn0245234
#7                    1        chr333466933346694EOG66WZW582.1698+678  chr3 FBgn0245234
#8                    1 chrXR_group854636745463675EOG6XH0KP1.86172+57 chrXR FBgn0247618
#9                    1         chr283746518374652EOG6V17MG2.45409-68  chr2 FBgn0078093
#10                   1      chr31338293913382940EOG63XVQR1.60785+105  chr3 FBgn0245234
Sotos
  • 51,121
  • 6
  • 32
  • 66