I have an annotation file and I want to parse out FlyBase transcript IDs to make a new column. I've tried regex, but it hasn't worked. Not sure if I just might not be using it correctly. The IDs are either at the beginning or in the middle of the string, which is this case is a collection of IDs from different databases. There might also be multiple FlyBase IDs in which case I'd like to use a separator like ID1/ID2
.
Example annotation lines: "AY113634 // --- // 100 // 2 // 2 // 0 /// FBtr0089787 // --- // 100 // 2 // 2 // 0"
"FBtr0079338 // --- // 100 // 15 // 15 // 0 /// FBtr0086326 // --- // 100 // 15 // 15 // 0 /// FBtr0100846 // --- // 100 // 15 // 15 // 0 /// NONDMET000145 // --- // 100 // 15 // 15 // 0 /// NONDMET000970 // --- // 100 // 15 // 15 // 0 /// NONDMET000971 // --- // 100 // 15 // 15 // 0"
I want to create a column that maintains the same order but only contains the FlyBase IDs with separators if necessary. I am working with the data.table
package so if there's a solution using data tables that would be much appreciated. One idea I have is to use sub
, search for [FBtr][0-9+]
(not sure if that's right) and if it doesn't match that pattern then replace it with ""
.
Example Table:
x <- data.table(probesetID = 1:10, probesetType = rep("main", 10), rep("FBtr0299871 // --- // 100 // FBtr193920 // 3 // 3 // 0", 10))