4

Let's say I have this string:

<div>john doe is nice guy btw 8240 E. Marblehead Way 92808  is also</div>

or this string:

<div>sky being blue? in the world is true? 024 Brea Mall  Brea, California 92821 jackfroast nipping on the firehead</div>

How would I go about extracting the address from one of these strings? This would involve some sort of Regex, right?

I've tried looking online for a solution using JavaScript or PHP, but to no avail. And no other post here on Stack Overflow (as far as I know) provides a solution that uses jQuery and/or Javascript and/or PHP. (The closest is Parse usable Street Address, City, State, Zip from a string, which DOESN'T have any code in the thread about extracting a postal code from a string.

Can somebody point me in the right direction? How would I go about accomplishing this in jQuery or JavaScript or PHP?

Community
  • 1
  • 1
  • 2
    That looks like a case for regular expressions. I still won't help you, because I question your motives. – Philipp Dec 30 '12 at 00:10
  • 3
    @Philipp What motives?!? –  Dec 30 '12 at 00:11
  • 1
    going to need a serious set of regex filters to validate addresses regardless of which language you do it in ...good luck...this won't be trivial! – charlietfl Dec 30 '12 at 00:12
  • 1
    @Philipp Wait what? I need to parse addresses for my reminder service! Here's the URL I'm developing it at! http://dumbsearch.com/now2.php When people enter reminders, I want to detect the address, so that when the date comes it will display the reminder and how many minutes it takes to get there, and a link to the Apple Maps. This is a web app for iPhone, but it also works on desktop. Try it! Most of my other questions were related to this! Like look at http://stackoverflow.com/questions/14014619/simplexml-not-returning-anything ! Inside the question, I'm asking my MapQuest API doesn't work. –  Dec 30 '12 at 00:32
  • 1
    @Philipp Why did you delete you comment? –  Dec 30 '12 at 00:51
  • if this is data people enter...is it not coming from an address field(s) in form? And stored as address in DB? If so, wrap it in html element with a class in server code when you output it..and you have absolute data. regex methods are not simple, and would be prone to many formatting problems – charlietfl Dec 30 '12 at 19:16
  • @charlietfl Nope, I let the user enter the reminder information in the textarea. I would like to get the information entered, and if there is an address, to parse it. –  Dec 30 '12 at 22:54
  • Om, a reasonable explanation for the downvotes? –  Dec 31 '12 at 19:26

6 Answers6

23

Tried this on twelve different strings that were similar to yours and it worked just fine:

function str_to_address($context) { 

    $context_parts = array_reverse(explode(" ", $context)); 
    $zipKey = ""; 
    foreach($context_parts as $key=>$str) { 
        if(strlen($str)===5 && is_numeric($str)) { 
            $zipKey = $key;
            break; 
        }
    }

    $context_parts_cleaned = array_slice($context_parts, $zipKey); 
    $context_parts_normalized = array_reverse($context_parts_cleaned); 
    $houseNumberKey = ""; 
    foreach($context_parts_normalized as $key=>$str) { 
        if(strlen($str)>1 && strlen($str)<6 && is_numeric($str)) { 
            $houseNumberKey = $key;
            break; 
        }
    }

    $address_parts = array_slice($context_parts_normalized, $houseNumberKey);
    $string = implode(' ', $address_parts);
    return $string;
}

This assumes a house number of at least two digits, and no greater than six. This also assumes that the zip code isn't in the "expanded" form (e.g. 12345-6789). However this can be easily modified to fit that format (regex would be a good option here, something like (\d{5}-\d{4}).

But using regex for parsing user-inputted data... Not a good idea here, because we just don't know what a user is going to input because there were (as one can assume) no validations.

Walking through the code and logic, starting with creating the array from the context and grabbing the zip:

// split the context (for example, a sentence) into an array, 
// so we can loop through it. 
// we reverse the array, as we're going to grab the zip first. 
// why? we KNOW the zip is 5 characters long*.
$context_parts = array_reverse(explode(" ", $context));  

// we're going to store the array index of the zip code for later use 
$zipKey = ""; 

// foreach iterates over an object given the params, 
// in this case it's like doing... 
// for each value of $context_parts ($str), and each index ($key)
foreach($context_parts as $key=>$str) { 

    // if $str is 5 chars long, and numeric... 
    // an incredibly lazy check for a zip code...
    if(strlen($str)===5 && is_numeric($str)) {  
        $zipKey = $key;

        // we have what we want, so we can leave the loop with break
        break; 
    }
}

Do some tidying so we have a better object to garb the house number from

// remove junk from $context_array, since we don't 
// need stuff after the zip
$context_parts_cleaned = array_slice($context_parts, $zipKey); 

// since the house number comes first, let's go back to the start
$context_parts_normalized = array_reverse($context_parts_cleaned);

And then let's grab the house number, using the same basic logic that we did the zip code:

$houseNumberKey = ""; 
foreach($context_parts_normalized as $key=>$str) { 
    if(strlen($str)>1 && strlen($str)<6 && is_numeric($str)) { 
        $houseNumberKey = $key;
        break; 
    }
}

// we probably have the parts we for the address.
// let's do some more cleaning 
$address_parts = array_slice($context_parts_normalized, $houseNumberKey);

// and build the string again, from the address
$string = implode(' ', $address_parts);

// and return the string
return $string;
Josh Brody
  • 5,153
  • 1
  • 14
  • 25
  • 4
    WOWW! Thanks for the response! So comprehensive!!! So descriptive! SO GOOOD! (BTW I awarded you the 100 point bounty, so now your reputation is +100 :)) I also marked you answer as correct, and also upvoted it. It works on EVERY test, weather it has other numerals number in the string or it doesn't! :) –  Jan 02 '13 at 01:06
  • 2
    Thanks again SO MUCH for your WONDERFUL RESPONSE! –  Jan 02 '13 at 05:32
  • 2
    no but your genius! I coudn't find a similar script anywhere! –  Jan 02 '13 at 18:41
  • 2
    Sometimes the best solution is often the simplest. :) – Josh Brody Jan 02 '13 at 19:30
  • 1
    :) Yeah well thanks again. :) I'm glad I was able to give you +100 rep. :) –  Jan 02 '13 at 19:38
  • Can you post this code in javascript also? @Josh Brody – Lakshmi Feb 24 '22 at 09:01
2

Regular expressions are used to test against patterns. You need to know what pattern you're looking for. From the two examples you provided, I would look for a number, then some text, ending with a five digit number.

All the addresses would have to be in this format. You can't magically just extract addresses from a string.

ehsangh
  • 311
  • 2
  • 6
  • 16
  • ... But can someone provide a sample regular expression that looks for this? (a number, text, ending 5 digit number) –  Dec 30 '12 at 00:17
  • This thread: http://stackoverflow.com/questions/16413/parse-usable-street-address-city-state-zip-from-a-string provides some good pointers of which matches to find... but I'd like to have a sample regex code of this in action... Thanks! (PS you get my upvote) –  Dec 30 '12 at 00:19
2

If all yours Address start and end's with numbers, you can use this Regular Expression to extract data you need:

/[0-9].+[0-9]/gi

Javascript exemple:

"<div>john doe is nice guy btw 8240 E. Marblehead Way 92808  is also</div>".match(/[0-9].+[0-9]/gi) // ["8240 E. Marblehead Way 92808"]
"<div>sky being blue? in the world is true? 024 Brea Mall  Brea, California 92821 jackfroast nipping on the firehead</div>".match(/[0-9].+[0-9]/gi) // ["024 Brea Mall  Brea, California 92821"]

For the new example, that contains phone number, you can do:

/[0-9].*[0-9]/gi

Javascript exemple:

"john doe 7143138656 is 8240 e marblehead way 92808".match(/[0-9].*[0-9]/gi) // ["7143138656 is 8240 e marblehead way 92808"]

But this will help you only if you have an match info per line. If you really need's a powerfull address matcher, you wil need to go ahead, and create powerfull analysis.

You can begin search in the text for target keywords, then filter the paragrapher, to then strip the info you seeking for.

It's not an easy question, but can be done, you can use more then one regexp for some matches, but if the address doesn't have an pattern, the regexp will be useless, that time you will need to change your aproach.

Gabriel Gartz
  • 2,840
  • 22
  • 24
  • Thanks, but this wouldn't work as there would also be phone numbers in the string... :( –  Dec 30 '12 at 17:19
  • Please provide an example, You also can change the + for * to get until the last numeric value, so will get all the line inside numbers too. – Gabriel Gartz Jan 01 '13 at 17:16
  • 1
    Thanks so much for you help, BTW. :) Here's an example: "john doe 7143138656 is 8240 e marblehead way 92808" –  Jan 01 '13 at 17:54
1

It is a common "mistake" to try and parse everything with Regular Expressions due to convenience. However, regular expressions are not the answer to everything. In this case it doesn't look like you are looking for regular patterns in text, but rather "natural" expressions someone would write as if they are talking to you. These natural expression won't necessarily follow any consistent pattern at all. Some people put appt numbers first then building number, some people leave out the city and skip to the zip code, some people might put city, state, country THEN zip. It just won't be possible to enumerate every possible regex pattern that someone could cook up with an address.

For natural language addresses I would forget regex address detection and move towards a stateful parsing algorithm.

  1. I would start by reading the text from left to right (at least in English) one word at a time. At each word you would do one logical test "could this word be the start of an address?". I would suppose this is a number for either a building number or appt/unit/box number (so "Box XXX", "PO BOX XXX", "PO XXX", "Unit XXX", "#XXX" or any number less than 6 digits in length). While I don't know this to be factually true I've never seen a north american building number 7 digits in length which is the minimum for a phone. So I would suspect you could sort out phone numbers vs building numbers fairly easily. This "start of address" test could be a set of regex matches, but we're not matching the whole address, just testing for words or phrases that start an address. I'd probably even say it'd be simpler without regex matching.

  2. Once you've detected the start of an address you create an "address parsing state object" (some class you use to hold the address as your continue parsing and keep track of what you have so far and what you expect next). Now you can continue stepping through the sentence and continue adding to your parser state object. Following a building number, I'd probably expect a street name or a directional indicator (N. E. W. S. NE. NW. SE. SW.). If neither of those come next stop your address parsing and assume an invalid or incomplete address, keep looking for new start of address words. Otherwise add the street name and/or directional indicators to your parse tree and keep going!

  3. Anything following a street name could be infinitely variable. Some users may just stop at building number and street name (assuming their local city/region/country). Otherwise you are probably looking for either a city name or a postal code/zip code. If found, add to your address parsing state object, if not assume an incomplete address (fill with user default location info?) or invalid address (ignore and continue looking for another start of address?).

Ultimately this approach could be one fairly simply JavaScript method with maybe a couple hundred lines of code (I'm not a PHP guy, but I assume it'd be similar). If you were to try and enumerate every possible regex pattern, someone could construct an address with, you'd have hundreds of those alone and it'd still be unreliable! (Probably slow too if you are trying to match hundreds of regex patterns).

BenSwayne
  • 16,810
  • 3
  • 58
  • 75
  • Thanks for the response! :) Is there any what that you could point me to a premade PHP parsing algorithm best for this? Or code you code me a basic JavaScript example? Thanks. :) –  Jan 01 '13 at 22:40
  • @DumbProducts Glad this is helpful. I think the nature of this site is more about helping you do it yourself with a little strategic help/guidance. If you want it just written for you in code, feel free to click through my profile to find me at my day job and purchase some consulting time. It would be a few hours work which I'm not going to do for free on here. I gotta eat too. :-) – BenSwayne Jan 01 '13 at 23:09
0

I've had the best luck using Google Geocode API. It takes the difficulty of trying to think of every possible way an address string may be input.

I recently had to extract parts of an address from a single string for a real estate website, and I found that the best option was to use google geocode API. It allowed me to get Street, City, State, Zip, Latitude, Longitude, and more for every address entered.

I found a great guide on getting set up with google geocode API (PHP) here: http://www.andrew-kirkpatrick.com/2011/10/google-geocoding-api-with-php/

The best part, it even works with names of places. So a search for 'UCLA' or 'Apple Headquarters' will give you all the parts of an address that you might need.

Bayo
  • 251
  • 2
  • 4
-1

My thinking says you should have something to tell your code that 'form here to here is a address and the rest is simple text'. For that either you make an array of address or keep the addresses in a database from where you can compare it with your inserted values

Roger
  • 1,693
  • 1
  • 18
  • 34