Cleaning Street Addresses in Text Mining

Question

Looking for a way to remove street addresses from the text I currently have. Is there a regular expression that can detect text within range of numbers? What I'm thinking is that I have a zip code and usually a number at the start of the address.

1234 Parks St., Los Angeles, CA 90001

My main issue is that I want to remove the street name from my dataset while I do my other cleaning and look for other words within my set.

I am using Rstudio to do the cleaning.

If your addresses are not virtually identical in terms of their structure, geocoding might be your best bet: http://stackoverflow.com/questions/16413/parse-usable-street-address-city-state-zip-from-a-string — majom, Aug 15 '16 at 21:18
That's what I would like to aim for. Each line of text has a mix of different information in there. Sometimes is just a summary of a conversation. Sometimes it's an actions such as "John will be contacted at 1234 Parks St., Los Angeles, CA 90001." — JohnnyJup, Aug 15 '16 at 21:31

IRTFM · Accepted Answer · 2016-08-16T07:30:52.053

This returns a character vector. Read the regex as breaking it into three capture-groups with the parens: the first is any count of consecutive digits, followed by any number of non-digits, followed by 5 digits. Return only the first and the third with a space in-between (if there is a match) and make no change if no match;

> gsub("([0-9]*)(\\D*)(\\d{5})", "\\1 \\3", test)
[1] "1234 90001" "9876 94501"

It would need further parsing to return a set of numeric vectors

> scan( text=gsub("([0-9]*)(\\D*)(\\d{5})", "\\1 \\3", test), what=list("", "") )
Read 2 records
[[1]]
[1] "1234" "9876"

[[2]]
[1] "90001" "94501"

Probably better to read in zips as character (because you will want to preserve leading zeros), but could convert the street numbers to numeric by changing the what list types:

> scan( text=gsub("([0-9]*)(\\D*)(\\d{5})", "\\1 \\3", test), what=list( numeric(), "") )
Read 2 records
[[1]]
[1] 1234 9876

[[2]]
[1] "90001" "94501"

To make this more useful:

> setNames( data.frame( scan( text=gsub("([0-9]*)(\\D*)(\\d{5})", "\\1 \\3", test), 
                              what=list( numeric(), "") ) , 
                       stringsAsFactors=FALSE), 
            c( "StrtNumber", "ZIP") )
Read 2 records
  StrtNumber   ZIP
1       1234 90001
2       9876 94501

Cleaning Street Addresses in Text Mining

1 Answers1