1

I have hundreds of addresses in a data frame variable and need to extract zip codes from them. Some of the addresses contain more than one city, each with a zip code. Here is a mock example of a data frame and R code that extracts the zip codes.

require(qdapRegex)
require(stringr)

df <- data.frame(address = c("Walnut; 94596, Ontario, 91761, Beach, CA 90071", "Irvine Cal 92164"), var2 = "text")
df$zip.Rinker <- sapply(df$address, FUN = rm_zip, extract=TRUE) 

The rm_zip function from Tyler Rinker's qdapRegex package extracts all the zip codes and puts them in list if there is more than one.

> df
                                         address var2          zip.Rinker
1 Walnut; 94596, Ontario, 91761, Beach, CA 90071 text 94596, 91761, 90071
2                               Irvine Cal 92164 text               92164

How can R create a new row for each zip code that is in row 1 under zip.Rinker? Something like the following would be ideal. Note, there will be dozens of addresses that have multiple zip codes, so I am hoping for a solution that doesn't require manual steps.

                                         address var2          zip.Rinker
1 Walnut; 94596, Ontario, 91761, Beach, CA 90071 text               94596
2 Walnut; 94596, Ontario, 91761, Beach, CA 90071 text               91761
3 Walnut; 94596, Ontario, 91761, Beach, CA 90071 text               90071
4                               Irvine Cal 92164 text               92164

Thank you for time.

PS Using stringr, this code extracts zip codes and presents the same challenge.

df$zip.stringr <- str_extract_all(string = df$address, pattern = "\\d{5}") 
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
lawyeR
  • 7,488
  • 5
  • 33
  • 63
  • possible duplicate of [Split string and insert as new rows](http://stackoverflow.com/questions/15347282/split-string-and-insert-as-new-rows) – hrbrmstr Oct 12 '14 at 12:40

4 Answers4

2

You could do:

data.frame(rep(df$address, sapply(df$zip.Rinker, length)), unlist(df$zip.Rinker)

##   rep.df.address..sapply.df.zip.Rinker..length.. unlist.df.zip.Rinker.
## 1 Walnut; 94596, Ontario, 91761, Beach, CA 90071                 94596
## 2 Walnut; 94596, Ontario, 91761, Beach, CA 90071                 91761
## 3 Walnut; 94596, Ontario, 91761, Beach, CA 90071                 90071
## 4                               Irvine Cal 92164                 92164

But note that rm_zip is already vectorized and pretty speedy as it wraps the stringi package. So no need for sapply. Here's an approach that makes the code much more condensed using qdapTools's list2df that takes a named list of vectors and turns them into a data.frame.

library(qdapTools)
list2df(setNames(rm_zip(df$address, extract=TRUE), df$address), "zip", "address")[, 2:1]

##                                          address   zip
## 1 Walnut; 94596, Ontario, 91761, Beach, CA 90071 94596
## 2 Walnut; 94596, Ontario, 91761, Beach, CA 90071 91761
## 3 Walnut; 94596, Ontario, 91761, Beach, CA 90071 90071
## 4                               Irvine Cal 92164 92164

And I like the magrittr framework for nested functions so here's that:

library(qdapTools)
library(magrittr)

df$address %>%
    rm_zip(extract=TRUE) %>%
    setNames(df$address) %>%
    list2df("zip", "address") %>%
    `[`(, 2:1)
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
1

Here's an approach using "data.table" and gregexpr/regmatches:

library(data.table)
as.data.table(df)[, c(.SD, Zips = unlist(list(
  Zips = regmatches(address, gregexpr("\\d{5}", address))))), 
  by = 1:nrow(df)]
#    nrow                                        address var2  Zips
# 1:    1 Walnut; 94596, Ontario, 91761, Beach, CA 90071 text 94596
# 2:    1 Walnut; 94596, Ontario, 91761, Beach, CA 90071 text 91761
# 3:    1 Walnut; 94596, Ontario, 91761, Beach, CA 90071 text 90071
# 4:    2                               Irvine Cal 92164 text 92164
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • I must be doing something wrong. I get this > as.data.table(df)[, c(.SD, Zips = unlist(list( + Zips = regmatches(address, gregexpr("\\d{5}", address))))), + by = 1:nrow(df)] Error in `[.data.table`(as.data.table(df), , c(.SD, Zips = unlist(list(Zips = regmatches(address, : j doesn't evaluate to the same number of columns for each group – lawyeR Oct 12 '14 at 13:00
  • @lawyeR, I just double-checked, posting the same code above, and it worked without a problem for me. What version of "data.table" are you using? – A5C1D2H2I1M1N2O1R2T1 Oct 12 '14 at 13:14
  • I am using Version 1.9.2 of data.table. Sorry to bother you. I will re-install and try again – lawyeR Oct 12 '14 at 13:29
  • @lawyeR, OK. I think 1.9.4 is on CRAN now, so you can try that and see if it makes a difference. – A5C1D2H2I1M1N2O1R2T1 Oct 12 '14 at 13:32
0

Here's an approach using just the stringi package:

library(stringi)
zip <- stri_extract_all_regex(df$address, "\\d{5}") 
data.frame(address=rep(df$address, sapply(zip, length)), zip=unlist(zip))

##                                          address   zip
## 1 Walnut; 94596, Ontario, 91761, Beach, CA 90071 94596
## 2 Walnut; 94596, Ontario, 91761, Beach, CA 90071 91761
## 3 Walnut; 94596, Ontario, 91761, Beach, CA 90071 90071
## 4                               Irvine Cal 92164 92164
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
0

One more approach, this one using only base R with hwnd's regular expression for extracting zip codes Remove US zip codes from a string: Regex

match <- gregexpr('(?<!\\d)\\d{5}(?:[ -]\\d{4})?\\b', df$address, perl=T)
zips <- regmatches(df$address,match)
nn <- rep(1:length(match),sapply(zips,length))
data.frame(df[nn,], zip=unlist(zips))

                                          address var2   zip
1   Walnut; 94596, Ontario, 91761, Beach, CA 90071 text 94596
1.1 Walnut; 94596, Ontario, 91761, Beach, CA 90071 text 91761
1.2 Walnut; 94596, Ontario, 91761, Beach, CA 90071 text 90071
2                                 Irvine Cal 92164 text 92164
Community
  • 1
  • 1
WaltS
  • 5,410
  • 2
  • 18
  • 24