I have:
- A data frame test_var with ODB6_OD_ID, Gene_ID and some other columns.
- A data frame arranged_data with architecture_number, Details and Sequence. (Sequence data is not relevant)
I want to find out if arranged_data$Details contains any value from test_var$ODB6_OD_ID. That is, if the the string in arranged_data$Details contains a substring which is any of the values in test_var$ODB6_OD_ID. If the sequence is present, then append the test_var$Gene_ID of the respective ODB6_OD_ID to a file.
I have to do this for every architecture_number. There are around 18 architectures with the total dataset being 5660.
This is the code that I have written, but it is taking too long, which I assume because of large loops: (This is only for architecture 1. I have to do for architectures from 1-18) (Architecture 1 contains around 350+ sequences)
while (arranged_data$architecture_number==1)
{
if(grepl(arranged_data$V2,test_var$ODB6_OG_ID)==TRUE)
{
write.table(test_var$Gene_ID, file = "architecture1", append = TRUE, sep = '\n')
}
}
My dataset looks like: test_var:
ODB6_OG_ID start Gene_ID
EOG60024F chrXR_group6 FBgn0247618
EOG60024H chr4_group3 FBgn0070413
EOG60024K chr2 FBgn0078093
EOG60024M chr2 FBgn0243975
EOG60024V chr4_group5 FBgn0247694
EOG60025C chrXL_group1a FBgn0247949
EOG60025F chr3 FBgn0245234
EOG602XCD chr4_group3 FBgn0080574
EOG602XCQ chr4_group3 FBgn0078791
arranged_data contains:
architecture_number Details
1 chr317678741767875EOG6HQF5814.8092+47
1 chr325176942517695EOG6NKCGX23.1869-87
1 chr391494069149407EOG6NZVDZ2.96183+105
1 chr246642624664263EOG6Z638J1.52323+138
1 chr4_group3231407231408EOG6QRHQP4.65431-721
1 chr311648221164823EOG6X3HNJ2.28484+96
1 chr333466933346694EOG66WZW582.1698+678
1 chrXR_group854636745463675EOG6XH0KP1.86172+57
1 chr283746518374652EOG6V17MG2.45409-68
1 chr31338293913382940EOG63XVQR1.60785+105
Required output: FBgn0247618 FBgn0070413 FBgn0078093 etc.
(These are not in order.)
Other information: OS: Ubuntu Xenial Xerus 16.04 R Version: 3.3.0 RStudio version: 0.99.902