In R, how can I manipulate variable in dataframe using regular expression?

Question

This is the dataset

df1 <- data.frame("id" = c("ebi.ac.uk:MIAMExpress:Reporter:A-MEXP-503.100044", 
                       "ebi.ac.uk:MIAMExpress:Reporter:A-MEXP-783.100435",
                       "ebi.ac.uk:MIAMExpress:Reporter:C-DEA-783.100435"),
              "Name" = c("ABC", "DEF", ""))

The product of the dataset

                                                  id   Name
1   ebi.ac.uk:MIAMExpress:Reporter:A-MEXP-503.100044    ABC
2   ebi.ac.uk:MIAMExpress:Reporter:A-MEXP-503.100435    DEF
3   ebi.ac.uk:MIAMExpress:Reporter:A-MEXP-503.100488

I want to make the dataframe look like this

       id     Name
1  100044      ABC
2  100435      DEF
3  100488       NA

Can anyone show me how to approach this problem?

In this case you should be able to just get the substring. For instance by using `substr` or maybe by the faster `strsplit` on `.`. — Bram Vanroy, Aug 27 '16 at 22:48

score 2 · Accepted Answer · edited May 23 '17 at 11:51

2

Regex way to find the last dot:

df1$id <- as.character(df1$id)
regexpr("\\.[^\\.]*$", df1$id) # may not need \\ on second one

or sapply(gregexpr("\\.", x), tail, 1)

Easier to remember, non-regex way:

df1$id <- as.character(df1$id)

df1$id <- sapply(strsplit(df1$id,split="\\."),tail,1)
df1$Name[df1$Name == ""] <- NA

df1

      id Name
1 100044  ABC
2 100435  DEF
3 100435 <NA>

sapply(strsplit(df1$id,split="\\."),tail,1) is from here.

edited May 23 '17 at 11:51

Community

1
1

answered Aug 27 '16 at 22:50

Hack-R

22,422
14
75
131

You do not need the double escapes before a period within a character-class sub-expression. (They are not "meta" within the character-class evaluation environment. see `$regex`). I find it easier to specify a period-character in a pattern by just enclosing in square-brackets, so this could just be `"[.][^.]*$"` – IRTFM Aug 28 '16 at 01:11

In R, how can I manipulate variable in dataframe using regular expression?

1 Answers1