4

I am attempting to remove/extract zip codes from a character string. The logic is that I am grabbing things that:

  1. must contain exactly 5 consecutive digits OR
  2. must contain exactly 5 consecutive digits followed by a dash and then exactly 4 consecutive digits OR
  3. must contain exactly 5 consecutive digits followed by a space and then exactly 4 consecutive digits

The zip portion of string could start with a space but might not.

Here's a MWE and what I've tried. The 2 attempted regexes are based on this question and this question:

text.var <- c("Mr. Bean bought 2 tickets 2-613-213-4567",
  "43 Butter Rd, Brossard QC K0A 3P0 – 613 213 4567", 
  "Rat Race, XX, 12345",
  "Ignore phone numbers(613)2134567",
  "Grab zips with dashes 12345-6789 or no space before12345-6789",  
  "Grab zips with spaces 12345 6789 or no space before12345 6789",
  "I like 1234567 dogs"
)

pattern1 <- "\\d{5}([- ]*\\d{4})?"
pattern2 <- "[0-9]{5}(-[0-9]{4})?(?!.*[0-9]{5}(-[0-9]{4})?)"


regmatches(text.var, gregexpr(pattern1, text.var, perl = TRUE)) 
regmatches(text.var, gregexpr(pattern2, text.var, perl = TRUE)) 

## [[1]]
## character(0)
## 
## [[2]]
## character(0)
## 
## [[3]]
## [1] "12345"
## 
## [[4]]
## [1] "21345"
## 
## [[5]]
## [1] "12345-6789"
## 
## [[6]]
## [1] "12345"
## 
## [[7]]
## [1] "12345"

Desired Output

## [[1]]
## character(0)
## 
## [[2]]
## character(0)
## 
## [[3]]
## [1] "12345"
## 
## [[4]]
## character(0)
## 
## [[5]]
## [1] "12345-6789" "12345-6789"
## 
## [[6]]
## [1] "12345 6789" "12345 6789"
## 
## [[7]]
## character(0)

Note R's regular expressions are similar to other regex but are specific to R. This question is specific to R's regex not a general regex question.

Community
  • 1
  • 1
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • 1
    I am not sure a bout the note. When you use `perl=TRUE` for example , you can alos use perl regex so generally the classical regex is an R solution. – agstudy Aug 09 '14 at 22:46
  • 1
    @agstudy More along the lines of doubling up back slashes and any other R specific regex things (I don't know regex well enough to know what these things are but I've found non-R users' regexes often don't translate to R). – Tyler Rinker Aug 09 '14 at 23:02

5 Answers5

2

You can use a regex like this:

"(?<!\\d)(\\d{5}(?:[-\\s]\\d{4})?)\\b"

Working demo

enter image description here

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
2

This worked for me, and gave the desired output on all of your examples:

"(?<!\\d)(\\d{5}(?:[- ]\\d{4})?)(?!\\d)"
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
briantist
  • 45,546
  • 6
  • 82
  • 127
2

Lookaround assertion

You can use a combination of Negative Lookbehind and a word boundary \b here.

regmatches(text.var, gregexpr('(?<!\\d)\\d{5}(?:[ -]\\d{4})?\\b', text.var, perl=T))

Explanation:

  • The negative lookbehind asserts that what precedes is not a digit.
  • Word boundary asserts that on one side there is a word character, and on the other side there is not.

    (?<!        # look behind to see if there is not:
      \d        #   digits (0-9)
    )           # end of look-behind
    \d{5}       # digits (0-9) (5 times)
    (?:         # group, but do not capture (optional):
      [ -]      #   any character of: ' ', '-'
      \d{4}     #   digits (0-9) (4 times)
    )?          # end of grouping
    \b          # the boundary between a word character (\w) and not a word character
    

Additional options

You may consider using the stringi library package which performs faster.

> library(stringi)
> stri_extract_all_regex(text.var, '(?<!\\d)\\d{5}(?:[ -]\\d{4})?\\b')
hwnd
  • 69,796
  • 4
  • 95
  • 132
  • Works well. Thank you. The comments are also very helpful! – Tyler Rinker Aug 09 '14 at 23:04
  • Just saw you have a regex explaining tool: http://liveforfaith.com/re/explain.pl Very cool :-) Also I'm turning a number of these regexes into a quickie R package. I'd like to give you contributor authorship on the package with a name beyond SO's hwnd. If you'd like you're actual name used please send an email https://github.com/trinker – Tyler Rinker Aug 10 '14 at 01:48
1

RegEx with LookArounds:

(?<![0-9-])([0-9]{5}(?:[ -][0-9]{4})?)(?![0-9-])`  

Live demo: http://regex101.com/r/hU9oK4/1

The stuff we're after:

  • [0-9]{5} is the most important part, looking for exactly 5 digits

  • (?:[ -][0-9]{4})?) optionally followed by 4 more BUT only if joined by a space or minus sign

Boundaries, boundaries, boundaries:

  • (?<![0-9-]) first group: Negative LookBehind (makes sure there is no digit or dash)

  • (?![0-9-]) last group: Negative LookAhead ( —||— same pattern...)

Extra test case:

another zip 09788-4234has no space after
98712
987122
zip or range 12987-19222 ?
what about this serial 88101-8892-22912-9991-99101 ?
90872-8881

Why?

  • LookArounds don't consume characters
  • you shouldn't be picking up false positives (eg. the first or last 5 digits from a longer no.)
  • ZIP might be on it's own line, or at the very beginning or end
  • you could bump into a space-less address
  • 5 digits starting with a minus sign should not be a zip code

Final notes: this is not intended to be a final or bulletproof match code, you might still collect some zip lookalikes, especially because of the space between the digit groups in your requirement

Personal note: I find [0-9] character classes clearer and easier to understand for newcomers to RegEx even if they're included in a \d, but they're also faster and have a better compatibility between RegEx flavours. On the other hand, double escapes (eg. \\d are an ugly read)

Community
  • 1
  • 1
CSᵠ
  • 10,049
  • 9
  • 41
  • 64
0

The qdapRegex package has the rm_zip function (based on @hwnd's response) for this:

rm_zip(text.var)
rm_zip(text.var, extract=TRUE)

> rm_zip(text.var, extract=TRUE)
[[1]]
[1] NA

[[2]]
[1] NA

[[3]]
[1] "12345"

[[4]]
[1] NA

[[5]]
[1] "12345-6789" "12345-6789"

[[6]]
[1] "12345 6789" "12345 6789"

[[7]]
[1] NA
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519