R - Regex to separate string based on first dot?

Question

I have a column which is filled with strings containing multiple dots. I want to split this column into two containing the two substrings before and after the first dot.

I.e.

comb          num
UWEA.n.49.sp   3
KYFZ.n.89.kr   5
     ...

Into

 a         b       num
UWEA    n.49.sp     3
KYFZ    n.89.kr     5
     ...

I'm using the separate function from tidyr but cannot get the regexp correct. I'm trying to use the regex style from this answer:

foo %>%
    separate(comb, into=c('a', 'b'),
             sep="([^.]+)\\.(.*)")

So that column a should be determined by the first capture group ([^.]+) containing at least one non-dot characters, then the first dot, then the second capture group (.*) just matches whatever remains after.

However this doesn't seem to match anything:

a   b   num
         3
         5

Here's my dummy dataset:

library(dplyr)
library(tidyr)
foo <- data.frame(comb=replicate(10, 
                                 paste(paste(sample(LETTERS, 4), collapse=''),
                                       sample(c('p', 'n'), 1), 
                                       sample(1:100, 1), 
                                       paste(sample(letters, 2), collapse=''), 
                                       sep='.')
                                 ),
                  num = sample(1:10, 10, replace=T))

Why use a regex when there's a built-in function to split strings? https://stat.ethz.ch/R-manual/R-devel/library/base/html/strsplit.html — , Dec 21 '16 at 20:59
Unless I'm missing something, that function still requires regex for the `split` argument, I'd then have to manually set the result to two different columns. Using tidyr's `separate` function is simpler. — Stuart Lacy, Dec 21 '16 at 21:01
`foo %>% separate(comb, into = c("a","b"), sep = "(?<=[A-Z])\\.(?=[a-z]+)")`. — Abdou, Dec 21 '16 at 21:10
That doesn't work @RichScriven, column a is the full string and b is . Possibly because that example requires Perl like regex which `separate` doesn't allow. @Abdou that works! If you write it up as an answer with some explanation of what the `?<=` and `?=` does I'll accept it. — Stuart Lacy, Dec 21 '16 at 21:28

score 10 · Answer 1 · answered Dec 21 '16 at 21:22

This is a case where you can take advantage of the extra = "merge" option in separate. Because separate separates on symbols by default, you don't have to define the separator. If you wanted to, you could use "\\."

foo %>%
    separate(comb, into=c('a', 'b'), extra = "merge")

      a       b num
1  NPTE p.10.ku   4
2  YAIU p.54.lw   4
3  CHUR n.51.kx   6
4  EPGX n.14.lg   3
5  POBJ n.11.ja   5
6  LEWI n.72.un   7
7  WLAP n.20.ve  10
8  XZUY p.75.cf   6
9  ZSNJ  p.4.aj   3
10 ABKR n.69.ua   3

extra = "merge" takes all the extra pieces beyond the columns you defined and merges them into the last column.

Sorry I meant to say in the question I wanted a full regex answer as I'd like to improve my skills. I'd got it working this way but was frustrated at my inability to use what seemed to me to be a basic regex format. — Stuart Lacy, Dec 21 '16 at 21:29

score 4 · Accepted Answer · answered Dec 21 '16 at 21:42

I think @aosmith's answer is great and definitely less clunky than a regex solution involving lookarounds. But since you're intent on using regex, here it is:

foo %>% 
    separate(comb, 
             into = c("a","b"), 
             sep = "(?<=[A-Z])\\.(?=[a-z]+)")

The trick here is the regex itself. It uses what is known as lookaround. Basically, you are looking for a dot (.) that's placed between an uppercase letter and a lowercase letter (i.e. UWEA.n) for the sep parameter. It means: match a dot preceded by a capital letter and followed by a lowercase letter.

This allows the separate function to split the comb column on the dots that are between A and n or between Z and n, in your case.

I hope this helps.

akrun · Answer 3 · 2016-12-22T01:54:24.117

Here is a base R option . Replace the first . with , in the 'comb' column, read with read.csv to create two columns based on the delimiter , and cbind with the other columns of 'foo'

cbind(read.csv(text=sub("\\.", ",", foo$comb), 
          col.names = c('a', 'b'), header=FALSE), foo[-1])
#      a       b num
#1  GJMU n.83.cu   3
#2  IVMD p.85.ny   9
#3  HLQB p.94.rd   8
#4  WIJY n.92.sz   4
#5  QXCM n.38.lf   8
#6  UBNC n.82.js   5
#7  EPLZ n.56.kl   3
#8  YRBA  n.6.ny   8
#9  HQMR p.54.pn  10
#10 LBPO p.98.tv   7

Or another option is with extract from tidyr where we match one or more character that are not a ., place it in a capture group (([^.]+)), followed by a dot (\\.) followed by other characters in the second capture group ((.*)). The captured group characters return as two columns replacing the original 'comb' column.

library(tidyr)
extract(foo, comb, into = c("a", "b"), "([^.]+)\\.(.*)")
#      a       b num
#1  GJMU n.83.cu   3
#2  IVMD p.85.ny   9
#3  HLQB p.94.rd   8
#4  WIJY n.92.sz   4
#5  QXCM n.38.lf   8
#6  UBNC n.82.js   5
#7  EPLZ n.56.kl   3
#8  YRBA  n.6.ny   8
#9  HQMR p.54.pn  10
#10 LBPO p.98.tv   7

NOTE: There was no set.seed in the OP's post

R - Regex to separate string based on first dot?

3 Answers3