9

I'm looking for a regular expression to catch all digits in the first 7 characters in a string.

This string has 12 characters:

A12B345CD678

I would like to remove A and B only since they are within the first 7 chars (A12B345) and get

12345CD678

So, the CD678 should not be touched. My current solution in R:

paste(paste(str_extract_all(substr("A12B345CD678",1,7), "[0-9]+")[[1]],collapse=""),substr("A12B345CD678",8,nchar("A12B345CD678")),sep="‌​") 

It seems too complicated. I split the string at 7 as described, match any digits in the first 7 characters and bind it with the rest of the string.

Looking for a general answer, my current solution is to split the first 7 characters and just match all digits in this sub string.

Any help appreciated.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Sebastian
  • 2,430
  • 4
  • 23
  • 40
  • 2
    Your question doesn't make any sense to me. `12345CD678` is more than 7 characters, even more than 7 digits - and what do you mean by `the first E-F`? – Tim Pietzcker Feb 15 '16 at 13:23
  • 2
    Your example is confusing. What is "first E-F 7 characters"? Why do you also match CD even though they are not digits? – neuhaus Feb 15 '16 at 13:26
  • Please post your current solution. – bobble bubble Feb 15 '16 at 13:30
  • 1
    To be more clear, A12B345CD678 has 12 characters. I want to match all digits in the first 7 characters, so CD678 would not be touched.. My current solution in R: paste(paste(str_extract_all(substr("A12B345CD678",1,7), "[0-9]+")[[1]],collapse=""),substr("A12B345CD678",8,nchar("A12B345CD678")),sep="") which seems to be to complicated. I split the string at 7 as descriped, match any digets in the first 7 characters and bind it with the rest of the string. – Sebastian Feb 15 '16 at 13:40
  • Seems like you need variable length lookbehind [like this demo](http://regexstorm.net/tester?p=(%3f%3c%3d%5e.%7b0%2c6%7d)%5cD&i=A12B345CD678&o=m): `(?<=^.{0,6})\D` which is not available in R regex flavor. [This could work](https://regex101.com/r/uS0oF0/1) with `perl=TRUE` but looks even worse to maintain. – bobble bubble Feb 15 '16 at 13:59
  • I see this is more complex then I thought, thank you for your efforts bobble bubble – Sebastian Feb 15 '16 at 14:02

3 Answers3

14

You can use the known SKIP-FAIL regex trick to match all the rest of the string beginning with the 8th character, and only match non-digit characters within the first 7 with a lookbehind:

s <- "A12B345CD678"
gsub("(?<=.{7}).*$(*SKIP)(*F)|\\D", "", s, perl=T)
## => [1] "12345CD678"

See IDEONE demo

The perl=T is required for this regex to work. The regex breakdown:

  • (?<=.{7}).*$(*SKIP)(*F) - matches any character but a newline (add (?s) at the beginning if you have newline symbols in the input), as many as possible (.*) up to the end ($, also \\z might be required to remove final newlines), but only if preceded with 7 characters (this is set by the lookbehind (?<=.{7})). The (*SKIP)(*F) verbs make the engine omit the whole matched text and advance the regex index to the position at the end of that text.
  • | - or...
  • \\D - a non-digit character.

See the regex demo.

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
5

The regex solution is cool, but I'd use something easier to read for maintainability. E.g.

library(stringr)

str_sub(s, 1, 7) = gsub('[A-Z]', '', str_sub(s, 1, 7))
eddi
  • 49,088
  • 6
  • 104
  • 155
2

You can also use a simple negative lookbehind:

s <- "A12B345CD678"
gsub("(?<!.{7})\\D", "", s, perl=T)
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125