1

I am trying to determine in R how to split a column that has multiple fields with multiple delimiters.

From an API, I get a column in a data frame called "Location". It has multiple location identifiers in it. Here is an example of one entry. (edit- I added a couple more)

6540 BENNINGTON AVE
Kansas City, MO 64133
(39.005620414000475, -94.50998643299965)

4284 E 61ST ST
Kansas City, MO 64130
(39.014638172000446, -94.5335298549997)


3002 SPRUCE AVE
Kansas City, MO 64128
(39.07083265200049, -94.53320606399967)


6022 E Red Bridge Rd
Kansas City, MO 64134
(38.92458893200046, -94.52090062499968)

So the above is the entry in row 1-4, column "location".

I want split this into address, city, state, zip, long and lat columns. Some fields are separated by space or tab while others by comma. Also nothing is fixed width.

I have looked at the reshape package- but seems I need a single deliminator. I can't use space (or can I?) as the address has spaces in it.

Thoughts?

mpg
  • 3,679
  • 8
  • 36
  • 45
  • 1
    `strsplit` allows you to specify a regular expression for splitting. Have you even tried to use that yet? – MrFlick Aug 06 '14 at 19:39
  • 2
    Looks to me that you want a multi-line read. Post a more complete data example, with say 12-15 lines of data. – IRTFM Aug 06 '14 at 19:48
  • 2
    If you already have this column called "location" in a data frame in R called "Location." Post the output of `head(Location$location, 3)` or `dput(Location$location[1:3]`. You may want to have a good look at [how to make a great R reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Also, on an unrelated note, did you know you can vote? I only ask because I notice you've never voted. – Jota Aug 06 '14 at 21:27
  • 1
    It would also be helpful to know if there were any other patterns to the format. For example, are these all US addresses? Address format varies internationally. Is each of the three lines you show always separated by a newline or tab and are they always in the same order? Are States always abbreviated or are some written out? Is the zip code always just the 5-digit version or is it sometimes the 5+4 digit version? Or, alternately, do you only care to extract the 4 digits? – Cotton.Rockwood Aug 06 '14 at 22:06
  • I added several more rows (this is all public data from KCMO open source). As you can see the format is all basically the same. I like the idea of adding a comma, then splitting. Will play with that. – mpg Aug 07 '14 at 16:55

2 Answers2

2

If the data you have is not like this, let everyone know by adding code we can copy and paste into R to reproduce your data (see how this sample data can be easily copied and pasted into R?)

Sample data:

location <- c(
"6540 BENNINGTON AVE
Kansas City, MO 64133
(39.005620414000475, -94.50998643299965)",

"456 POOH LANE
New York City, NY 10025
(40, -90)")

location
#[1] "6540 BENNINGTON AVE\nKansas City, MO 64133\n(39.005620414000475, -94.50998643299965)"
#[2] "456 POOH LANE\nNew York City, NY 10025\n(40, -90)"

A solution:

# Insert a comma between the state abbreviation and the zip code
step1 <- gsub("([[:alpha:]]{2}) ([[:digit:]]{5})", "\\1,\\2", location)
# get rid of parentheses
step2 <- gsub("\\(|\\)", "", step1)
# split on "\n", ",", and ", "
strsplit(step2, "\n|,|, ")

#[[1]]
#[1] "6540 BENNINGTON AVE" "Kansas City"         "MO"                
#[4] "64133"               "39.005620414000475"  "-94.50998643299965"

#[[2]]
#[1] "456 POOH LANE"  "New York City" "NY"           "10025"        
#[5] "40"            "-90" 
Jota
  • 17,281
  • 7
  • 63
  • 93
  • A good solution without more information. Just in case, note that step1 could fail if you have rural routes which have "RR 1234 STREETNAME ST" as their first line. – Cotton.Rockwood Aug 06 '14 at 22:18
  • @Cotton.Rockwood Thanks for pointing that out! I'll wait to see if OP adds anything and responds to your comment on the question. – Jota Aug 06 '14 at 23:03
  • I added several more rows (this is all public data from KCMO open source). As you can see the format is all basically the same. I like the idea of adding a comma, then splitting. Will play with that. I have not used that function before. Thanks. – mpg Aug 07 '14 at 18:11
2

Here is an example with the stringr package. Using @Frank's example data from above, you can do:

library(stringr)
address <- str_match(location,
                     "(^[[:print:]]+)[[:space:]]([[:alpha:]. ]+), ([[:alpha:]]{2}) ([[:digit:]]{5})[[:space:]][(]([[:digit:].-]+), ([[:digit:].-]+)")
address <- data.frame(address[,-1]) # get rid of the first column which has the full match
names(address) <- c("address", "city", "state", "zip", "lat", "lon")

> address
              address          city state   zip                lat                lon
1 6540 BENNINGTON AVE   Kansas City    MO 64133 39.005620414000475 -94.50998643299965
2       456 POOH LANE New York City    NY 10025                 40                -90

Note that this is pretty specific to the format of the one entry given. It would need to be tweaked if there is variation in any number of ways. This takes everything from the start of the string to the first [:space:] character as address. The next set of letters, spaces and periods up until the next comma is given to city. After the comma and a space, the next two letters are given to state. Following a space, the next five digits make up the zip field. Finally, the next set of numbers, period and/or minus signs each get assigned to lat and lon.

Cotton.Rockwood
  • 1,601
  • 12
  • 29