Parse text with separator depending on its structure

Question

My dataframe:

>datasetM
                                 Mean
ENSORLG00000001933:tex11     2500.706       
ENSORLG00000010797:         44225.330       
ENSORLG00000003008:pabpc1a  11788.555       
ENSORLG00000001973:sept6     3100.493      
ENSORLG00000000997:          5418.796

Output needed:

>out
[1] "tex11" "ENSORLG00000010797" "pabpc1a" "sept6" "ENSORLG00000000997"

I tried this, but I only retrieve the part before the separator:

titles <- rownames(datasetM)
vapply(strsplit(titles,":"), `[`, 1, FUN.VALUE=character(1))

Note: There is not logic in the alternance of ENS000:name and ENS00:

Note 2: ENSOR are rownames

Note 3: When there is nothing after ":" I want the ENSOR

so when there's nothing after : then you need the ENSOR... right? — amrrs, Oct 27 '17 at 13:34

acylam · Accepted Answer · 2017-10-27T13:47:50.267

3

Here is a solution with base R:

sapply(strsplit(rownames(df), ":"), function(x) x[length(x)])
# [1] "tex11"              "ENSORLG00000010797" "pabpc1a"            "sept6"             
# [5] "ENSORLG00000000997"

Another solution with sub, might be simpler:

sub("^\\w+:(?=\\w)|:", "", rownames(df), perl = TRUE)
# [1] "tex11"              "ENSORLG00000010797" "pabpc1a"            "sept6"             
# [5] "ENSORLG00000000997"

Data:

df = read.table(text = "                                 Mean
ENSORLG00000001933:tex11     2500.706       
ENSORLG00000010797:         44225.330       
ENSORLG00000003008:pabpc1a  11788.555       
ENSORLG00000001973:sept6     3100.493      
ENSORLG00000000997:          5418.796", header = TRUE, row.names = 1)

edited Oct 27 '17 at 13:47

answered Oct 27 '17 at 13:38

acylam

18,231
5
36
45

1

I tried an alternative ```sapply(strsplit(rownames(df),split = ":"),function(x){ifelse(length(x)==2,x[2],x[1])})``` – amrrs Oct 27 '17 at 13:40
@amrrs Sure that works too, but don't you think mine is simpler :) – acylam Oct 27 '17 at 13:41
Indeed, That's why didn't add a separate answer :p – amrrs Oct 27 '17 at 13:46

score 2 · Answer 2 · answered Oct 27 '17 at 13:43

Here is a vectorized way to do this using a regex (taken from here) to identify the last character of each rowname,

 rownames(df)[!sub('.*(?=.$)', '', rownames(df), perl=TRUE) == ':'] <-
       sub('.*:', '', rownames(df)[!sub('.*(?=.$)', '', rownames(df), perl=TRUE) == ':'])

which gives,

                           V2
tex11                2500.706
ENSORLG00000010797: 44225.330
pabpc1a             11788.555
sept6                3100.493
ENSORLG00000000997:  5418.796

DATA

dput(df)
structure(list(V2 = c(2500.706, 44225.33, 11788.555, 3100.493, 
5418.796)), .Names = "V2", row.names = c("tex11", "ENSORLG00000010797:", 
"pabpc1a", "sept6", "ENSORLG00000000997:"), class = "data.frame")

NOTE You can remove the colons from rownames simply by

rownames(df) <- sub(':', '', rownames(df))

Parse text with separator depending on its structure

2 Answers2