Extracting Gene Annotation IDs in R

Question

I have an annotation file and I want to parse out FlyBase transcript IDs to make a new column. I've tried regex, but it hasn't worked. Not sure if I just might not be using it correctly. The IDs are either at the beginning or in the middle of the string, which is this case is a collection of IDs from different databases. There might also be multiple FlyBase IDs in which case I'd like to use a separator like ID1/ID2.

Example annotation lines: "AY113634 // --- // 100 // 2 // 2 // 0 /// FBtr0089787 // --- // 100 // 2 // 2 // 0"

"FBtr0079338 // --- // 100 // 15 // 15 // 0 /// FBtr0086326 // --- // 100 // 15 // 15 // 0 /// FBtr0100846 // --- // 100 // 15 // 15 // 0 /// NONDMET000145 // --- // 100 // 15 // 15 // 0 /// NONDMET000970 // --- // 100 // 15 // 15 // 0 /// NONDMET000971 // --- // 100 // 15 // 15 // 0"

I want to create a column that maintains the same order but only contains the FlyBase IDs with separators if necessary. I am working with the data.table package so if there's a solution using data tables that would be much appreciated. One idea I have is to use sub, search for [FBtr][0-9+] (not sure if that's right) and if it doesn't match that pattern then replace it with "".

Example Table: x <- data.table(probesetID = 1:10, probesetType = rep("main", 10), rep("FBtr0299871 // --- // 100 // FBtr193920 // 3 // 3 // 0", 10))

Could you please make a reproducible example: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — emilliman5, Oct 12 '17 at 17:15

emilliman5 · Answer 1 · 2017-10-12T18:18:16.307

Here is something to get you started, I can update the answer once I have a better idea of what your "data.table" looks like:

x <- "FBtr0079338 // --- // 100 // 15 // 15 // 0 /// FBtr0086326 // --- // 100 // 15 // 15 // 0 /// FBtr0100846 // --- // 100 // 15 // 15 // 0 /// NONDMET000145 // --- // 100 // 15 // 15 // 0 /// NONDMET000970 // --- // 100 // 15 // 15 // 0 /// NONDMET000971 // --- // 100 // 15 // 15 // 0"
sapply(strsplit(x, "/+"), function(s) grep("FBtr", trimws(s), value=TRUE))

#     [,1]         
#[1,] "FBtr0079338"
#[2,] "FBtr0086326"
#[3,] "FBtr0100846"

sapply(strsplit(x, "/+"), function(x) paste0(grep("FBtr", trimws(x), value=TRUE), collapse = ";"))
#[1] "FBtr0079338;FBtr0086326;FBtr0100846"

Edit:

To assign to a new column in the datatable:

x$FBtr <- sapply(strsplit(x$V3, "/+"), function(x) paste0(grep("FBtr", trimws(x), value=TRUE), collapse = ";"))

In essence you can supply the column containing the annotations inplace of x.

This might be enough, but I posted an example table anyway. – abbas786 Oct 12 '17 at 18:12 — abbas786, Oct 12 '17 at 18:12

score 0 · Accepted Answer · answered Oct 12 '17 at 18:35

0

More specific to data.table, and using the stringr package:

library(stringr)
x[, .(IDs = str_c(unlist(str_extract_all(V3, "(FBtr)[0-9]+")), 
    collapse = "/")), by = probesetID]

answered Oct 12 '17 at 18:35

David Klotz

2,401
1
7
16

Extracting Gene Annotation IDs in R

2 Answers2

Edit: