5

Using the XML package and XPath to scrape addresses from websites, I sometimes can get only a string that has embedded in it the zip code I want. It is straightforward to extract the zip code, but sometimes there are other five-digit strings that show up.

Here are some variations on the problem in a df.

zips <- data.frame(id = seq(1, 5), address = c("Company, 18540 Main Ave., City, ST 12345", "Company 18540 Main Ave. City ST 12345-0000", "Company 18540 Main Ave. City State 12345", "Company, 18540 Main Ave., City, ST 12345 USA", "Company, One Main Ave Suite 18540, City, ST 12345")) 

The R statement to extract zip codes (both 5 digit and plus 4 digits) is below, but it is tricked by the faux zip codes of the street number and the suite number (and there may be other possibilities in other address strings).

regmatches(zips$address, gregexpr("\\d{5}([-]?\\d{4})?", zips$address, perl = TRUE))

An answer to a previous SO question suggested that a "regex will return the last consecutive five digit string. It uses a negative look-ahead to ensure the absence of 5-digit strings after the one being returned."
Extracting a zip code from an address string

\b\d{5}\b(?!.*\b\d{5}\b)

But that question and answer deals with PHP and offers an if loop with preg_matches()` I am not familiar with those languages and tools, but the idea might be right.

My question: what R code will find real zip codes and ignore false lookalikes?

Community
  • 1
  • 1
lawyeR
  • 7,488
  • 5
  • 33
  • 63
  • @CarlWitthoft: thank you for the reference to a post emphasizing yet again the distinction between parsing HTML and using regex on HTML. Point taken. My question goes to what to do when parsing has done all it can do, and returns a block string with lots of address components. Is it not true that then regex is the tool to invoke? – lawyeR Aug 07 '14 at 13:42
  • 1
    lawyeR, ignore that comment, it has no relevance to your question. Just people adding pointless comments because they think they're being funny. Given your problem [return the last set of 5 consecutive digits from a string] there's no problem with using a regex for that. (That said, if your problem is more complicated than that, it may not be appropriate.) – Joe Aug 07 '14 at 14:17

2 Answers2

4

This is my first regex answer (I am still learning) so hopefully I don't say anything wrong to lead you in the wrong direction.

Basically, this regex looks for, as you hinted in your question, the last string that looks like a zip code which is not followed by a string that looks like a zip code

the basic syntax is pattern(?!.*pattern) which says to match pattern only if it is not followed (a negative look-ahead assertion, syntax: (?! )) by anything .* and pattern

so we can replace pattern with what you are interested in finding:

[0-9]{5}(-[0-9]{4})?

that is, a digit string [0-9] of exactly 5 characters {5} (which may optionally be followed ? by another group defined as a hyphen and another digit string of length four (-[0-9]{4})

put it all together with gregexpr to search for the matches and regmatches to interpret the results for me, I get:

zips <- data.frame(id = seq(1, 5), address = c("Company, 18540 Main Ave., City, ST 12345", "Company 18540 Main Ave. City ST 12345-0000", "Company 18540 Main Ave. City State 12345", "Company, 18540 Main Ave., City, ST 12345 USA", "Company, One Main Ave Suite 18540, City, ST 12345")) 
regmatches(zips$address,
           gregexpr('[0-9]{5}(-[0-9]{4})?(?!.*[0-9]{5}(-[0-9]{4})?)', zips$address, perl = TRUE))

# [[1]]
# [1] "12345"
# 
# [[2]]
# [1] "12345-0000"
# 
# [[3]]
# [1] "12345"
# 
# [[4]]
# [1] "12345"
# 
# [[5]]
# [1] "12345"
rawr
  • 20,481
  • 4
  • 44
  • 78
1

The qdapRegex package has the rm_zip function for this:

zips <- data.frame(id = seq(1, 5), 
    address = c("Company, 18540 Main Ave., City, ST 12345", 
    "Company 18540 Main Ave. City ST 12345-0000", 
    "Company 18540 Main Ave. City State 12345", 
    "Company, 18540 Main Ave., City, ST 12345 USA", 
    "Company, One Main Ave Suite 18540, City, ST 12345")
)

lapply(rm_zip(zips$address, extract=TRUE), tail, 1)

## [[1]]
## [1] "12345"
## 
## [[2]]
## [1] "12345-0000"
## 
## [[3]]
## [1] "12345"
## 
## [[4]]
## [1] "12345"
## 
## [[5]]
## [1] "12345"

EDIT Per @lawyeR's comments:

I think that you want some regex that is more specific than the dictionary system used by qdapRegex. The current implementation of rm_zip allows for validation purposes and thus I wouldn't alter the regular expression it uses to be more flexible. I also wouldn't alter the function rm_zip to have additional parameters/arguments as qdapRegex attempts to have consistently operating functions.

That being said you could create your own function using the rm_ function and supply your own regular expression. I have done this using both of the parameters specified in your comment:

More complex data set:

zips <- data.frame(id = seq(1, 6), 
    address = c("Company, 18540 Main Ave., City, ST 12345", 
    "Company 18540 Main Ave. City ST 12345-0000", 
    "Company 18540 Main Ave. City State 12345", 
    "Company, 18540 Main Ave., City, ST 12345 USA", 
    "Company, One Main Ave Suite 18540m, City, ST 12345",
    "company 12345678")
)

Function to grab even if a character follows the zip

## paste together a more flexible regular expression    
pat <- pastex(
    "@rm_zip", 
    "(?<!\\d)\\d{5}(?!\\d)",
    "(?<!\\d)\\d{5}-\\d{4}(?!\\d)"
)
## Create your own function that extract is set to TRUE
rm_zip2 <- rm_(pattern=pat, extract=TRUE)
rm_zip2(zips$address)

## [[1]]
## [1] "18540" "12345"
## 
## [[2]]
## [1] "18540"      "12345-0000"
## 
## [[3]]
## [1] "18540" "12345"
## 
## [[4]]
## [1] "18540" "12345"
## 
## [[5]]
## [1] "18540" "12345"
## 
## [[6]]
## [1] NA

Function to extract just 5 digit zips

rm_zip3 <- rm_(pattern="(?<!\\d)\\d{5}(?!\\d)", extract=TRUE)
rm_zip3(zips$address)

## [[1]]
## [1] "18540" "12345"
## 
## [[2]]
## [1] "18540" "12345"
## 
## [[3]]
## [1] "18540" "12345"
## 
## [[4]]
## [1] "18540" "12345"
## 
## [[5]]
## [1] "18540" "12345"
## 
## [[6]]
## [1] NA
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • thank you. I compared rm_zip() to the the call I had been using: str_extract(string = locations$location, pattern = "\\d{5}") on a large set of data I have. Your qdapRegex function did not extract 7 zip codes that each had a letter immediately after them, such as 02138M. Is there a fix? – lawyeR Sep 29 '14 at 11:52
  • not sure my previous comment had the correct @. Please let me know. Also, is there an argument to turn off the final four digits and hyphen preceding it? E.g., 02138 instead of 02138-1234 – lawyeR Sep 29 '14 at 11:56
  • @lawyeR That is outside of the consistency and validation realm of `qdapRegex` but you could use `rm_` to build your own function to do this as I demo-ed in the edit above. – Tyler Rinker Sep 29 '14 at 13:04
  • much obliged, and very impressed. Thank you. – lawyeR Sep 29 '14 at 13:23