Remove US zip codes from a string: R regex

Question

I am attempting to remove/extract zip codes from a character string. The logic is that I am grabbing things that:

must contain exactly 5 consecutive digits OR
must contain exactly 5 consecutive digits followed by a dash and then exactly 4 consecutive digits OR
must contain exactly 5 consecutive digits followed by a space and then exactly 4 consecutive digits

The zip portion of string could start with a space but might not.

Here's a MWE and what I've tried. The 2 attempted regexes are based on this question and this question:

text.var <- c("Mr. Bean bought 2 tickets 2-613-213-4567",
  "43 Butter Rd, Brossard QC K0A 3P0 – 613 213 4567", 
  "Rat Race, XX, 12345",
  "Ignore phone numbers(613)2134567",
  "Grab zips with dashes 12345-6789 or no space before12345-6789",  
  "Grab zips with spaces 12345 6789 or no space before12345 6789",
  "I like 1234567 dogs"
)

pattern1 <- "\\d{5}([- ]*\\d{4})?"
pattern2 <- "[0-9]{5}(-[0-9]{4})?(?!.*[0-9]{5}(-[0-9]{4})?)"


regmatches(text.var, gregexpr(pattern1, text.var, perl = TRUE)) 
regmatches(text.var, gregexpr(pattern2, text.var, perl = TRUE)) 

## [[1]]
## character(0)
## 
## [[2]]
## character(0)
## 
## [[3]]
## [1] "12345"
## 
## [[4]]
## [1] "21345"
## 
## [[5]]
## [1] "12345-6789"
## 
## [[6]]
## [1] "12345"
## 
## [[7]]
## [1] "12345"

Desired Output

## [[1]]
## character(0)
## 
## [[2]]
## character(0)
## 
## [[3]]
## [1] "12345"
## 
## [[4]]
## character(0)
## 
## [[5]]
## [1] "12345-6789" "12345-6789"
## 
## [[6]]
## [1] "12345 6789" "12345 6789"
## 
## [[7]]
## character(0)

Note R's regular expressions are similar to other regex but are specific to R. This question is specific to R's regex not a general regex question.

I am not sure a bout the note. When you use `perl=TRUE` for example , you can alos use perl regex so generally the classical regex is an R solution. — agstudy, Aug 09 '14 at 22:46
@agstudy More along the lines of doubling up back slashes and any other R specific regex things (I don't know regex well enough to know what these things are but I've found non-R users' regexes often don't translate to R). — Tyler Rinker, Aug 09 '14 at 23:02

score 2 · Answer 1 · edited Aug 09 '14 at 23:04

2

You can use a regex like this:

"(?<!\\d)(\\d{5}(?:[-\\s]\\d{4})?)\\b"

Working demo

enter image description here

edited Aug 09 '14 at 23:04

Tyler Rinker

108,132
65
322
519

answered Aug 09 '14 at 22:56

Federico Piazza

30,085
15
87
123

score 2 · Answer 2 · edited Aug 09 '14 at 23:06

2

This worked for me, and gave the desired output on all of your examples:

"(?<!\\d)(\\d{5}(?:[- ]\\d{4})?)(?!\\d)"

edited Aug 09 '14 at 23:06

Tyler Rinker

108,132
65
322
519

answered Aug 09 '14 at 22:56

briantist

45,546
6
82
127

hwnd · Accepted Answer · 2014-09-11T02:38:19.627

Lookaround assertion

You can use a combination of Negative Lookbehind and a word boundary \b here.

regmatches(text.var, gregexpr('(?<!\\d)\\d{5}(?:[ -]\\d{4})?\\b', text.var, perl=T))

Explanation:

The negative lookbehind asserts that what precedes is not a digit.

Word boundary asserts that on one side there is a word character, and on the other side there is not.

(?<!        # look behind to see if there is not:
  \d        #   digits (0-9)
)           # end of look-behind
\d{5}       # digits (0-9) (5 times)
(?:         # group, but do not capture (optional):
  [ -]      #   any character of: ' ', '-'
  \d{4}     #   digits (0-9) (4 times)
)?          # end of grouping
\b          # the boundary between a word character (\w) and not a word character

Additional options

You may consider using the stringi library package which performs faster.

> library(stringi)
> stri_extract_all_regex(text.var, '(?<!\\d)\\d{5}(?:[ -]\\d{4})?\\b')

Just saw you have a regex explaining tool: http://liveforfaith.com/re/explain.pl Very cool :-) Also I'm turning a number of these regexes into a quickie R package. I'd like to give you contributor authorship on the package with a name beyond SO's hwnd. If you'd like you're actual name used please send an email https://github.com/trinker — Tyler Rinker, Aug 10 '14 at 01:48

score 1 · Answer 4 · edited Jun 20 '20 at 09:12

RegEx with LookArounds:

(?<![0-9-])([0-9]{5}(?:[ -][0-9]{4})?)(?![0-9-])`

Live demo: http://regex101.com/r/hU9oK4/1

The stuff we're after:

[0-9]{5} is the most important part, looking for exactly 5 digits
(?:[ -][0-9]{4})?) optionally followed by 4 more BUT only if joined by a space or minus sign

Boundaries, boundaries, boundaries:

(?<![0-9-]) first group: Negative LookBehind (makes sure there is no digit or dash)
(?![0-9-]) last group: Negative LookAhead ( —||— same pattern...)

Extra test case:

another zip 09788-4234has no space after
98712
987122
zip or range 12987-19222 ?
what about this serial 88101-8892-22912-9991-99101 ?
90872-8881

Why?

LookArounds don't consume characters
you shouldn't be picking up false positives (eg. the first or last 5 digits from a longer no.)
ZIP might be on it's own line, or at the very beginning or end
you could bump into a space-less address
5 digits starting with a minus sign should not be a zip code

Final notes: this is not intended to be a final or bulletproof match code, you might still collect some zip lookalikes, especially because of the space between the digit groups in your requirement

_{Personal note: I find [0-9] character classes clearer and easier to understand for newcomers to RegEx even if they're included in a \d, but they're also faster and have a better compatibility between RegEx flavours. On the other hand, double escapes (eg. \\d are an ugly read)}

@hwnd indeed and also `[0-9]` *bypasses* the need to double escape `\d` — CSᵠ, Aug 10 '14 at 08:27

score 0 · Answer 5 · answered Sep 29 '14 at 04:26

The qdapRegex package has the rm_zip function (based on @hwnd's response) for this:

rm_zip(text.var)
rm_zip(text.var, extract=TRUE)

> rm_zip(text.var, extract=TRUE)
[[1]]
[1] NA

[[2]]
[1] NA

[[3]]
[1] "12345"

[[4]]
[1] NA

[[5]]
[1] "12345-6789" "12345-6789"

[[6]]
[1] "12345 6789" "12345 6789"

[[7]]
[1] NA

Remove US zip codes from a string: R regex

5 Answers5

Lookaround assertion

Additional options

RegEx with LookArounds:

Why?

Linked