2

My Problem

I'm attempting to crawl the individual links on the US House of Representatives Site to find Washington addresses for all of the listed individuals. The problem is that the format of the Washington address varies from time to time. Sometimes there are bullets, pipes, new lines and break-tags making it difficult to match.


I'm attempting to crawl many pages to retrieve addresses which are largely similar:

ignore peculiar whitespace. It's merely to show string-part similarities

    1433 Longworth House Office Building Washington,  D.C. 20515
     332 Cannon HOB                      Washington   DC   20515
    1641 LONGWORTH HOUSE OFFICE BUILDING WASHINGTON,  DC   20515
    1238 Cannon H.O.B. (line return)
    Washington, DC 20515
    8293 Longworth House Office Building • Washington DC • 20515
    8293 Longworth House Office Building | Washington DC | 20515

Each of these will come back individually surrounded by tons of other text and html tags. The addresses may even contain an <br> or <br/> within the address itself.

What I would like to do is capture the first match from the source string, and set it as the value of a variable. From my understanding, this would best be approached with a regular-expression.

Update:

After learning more about the various ways in which these days can appear, I've decided that a less-strict expression would be best. These addresses have been showing up with bullets, pipes, and newlines. Perhaps an expression that communicates the following would be best:

[numbers][anything]["washington"][anything][DC|D.C.][anything][five numbers]

Apparently that is way too loose. The anything blocks were bringing in paragraphs, when I'm merely interested in allowing a few chars of anything.

So far I've been unsuccessful at matching the addresses found on the following (these are just a few of the many)

Community
  • 1
  • 1
Sampson
  • 265,109
  • 74
  • 539
  • 565
  • Difficult since everyone seems to have their addresses formatted completely differently. I think the best bet would be to first strip all the HTML tags from your input and then apply the regex mentioned below in my answer. That should work better. I don't know PHP, so I can't tell you how to strip HTML tags, but this has surely been answered on SO before. – Tim Pietzcker Dec 26 '09 at 09:33
  • Certainly not an answer, but a bit of reading that might interest you since you're experiencing address problems firsthand: http://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/ – joequincy Jul 03 '13 at 17:22

5 Answers5

2

EDIT: It appears as though the [anything] data in between the first set of numbers and 'washington' has to be a little more restrictive to work properly. The [anything] section should not contain any numbers, as, well, numbers are what we use to delimit the start of one of the addresses. This works for the three websites you gave us.

I'd say the best first step would be to strip out all HTML tags and replace the ' ' character entity:

$input = strip_tags($input);
$input = preg_replace("/&nbsp;/"," ",$input);

then if the addresses match (close to) the format you specified, do:

$results= array();
preg_match("/[0-9]+\s+[^0-9]*?\s+washington,?\s*D\.?C\.?[^0-9]+[0-9]{5}/si",$input,$results);
foreach($result[0] as $addr){
    echo "$addr<br/>";
}

This works for the three examples you provided, and $results[0] should contain each of the addresses found.

However, this won't work, for instance, if the address has an 'Apartment #2' or the like in it, because it assumes that the numbers closest to 'Washington, DC' mark the start of the address.

The following script matches each of the test cases:

<?php
    $input = "
        1433&nbsp;Longworth House Office Building Washington,  D.C. 20515
         332 Cannon HOB                      Washington   DC   20515
        1641 LONGWORTH HOUSE OFFICE BUILDING WASHINGTON,  DC   20515
        1238 Cannon H.O.B.
        Washington, DC 20515
        8293 Longworth House Office Building • Washington DC • 20515
        8293 Longworth House Office Building | Washington DC | 20515
    ";
    $input = strip_tags($input);
    $input = preg_replace("/&nbsp;/"," ",$input);

    $results= array();
    preg_match_all("/[0-9]+\s+[^0-9]*?washington,?\s*D\.?C\.?[^0-9]*?[0-9]{5}/si",$input,$results);
    foreach($results[0] as $addr){
        echo "$addr<br/>";
    }
cmptrgeekken
  • 8,052
  • 3
  • 29
  • 35
1

There are tools and APIs that are built to do this. For example, one that works quite well is LiveAddress by SmartyStreets. I helped develop it, and so I feel some of your pain... Here's the output from the sample you provided in your question:

enter image description here

Here is the CSV output:

ID,Start,End,Segment,Verified,Candidate,Firm,FirstLine,SecondLine,LastLine,City,State,ZIPCode,County,DpvFootnotes,DeliveryPointBarcode,Active,Vacant,CMRA,MatchCode,Latitude,Longitude,Precision,RDI,RecordType,BuildingDefaultIndicator,CongressionalDistrict,Footnotes
1,4,69,"1433&nbsp;Longworth House Office Building Washington, D.C. 20515",Y,0,,1433 Longworth House Office Building Washington D,,Washington DC 20515-0001,Washington,DC,20515,District of Columbia,AAU1,205150001330,,,,Y,38.89106,-77.01132,Zip5,Residential,S,,AL,Q#X#
2,75,134,332 Cannon HOB Washington DC 20515,Y,0,,332 Cannon Hob,,Washington DC 20515-3226,Washington,DC,20515,District of Columbia,AAU1,205153226996,,,,Y,38.89106,-77.01132,Zip5,Residential,H,Y,AL,H#Q#
3,139,199,"1641 LONGWORTH HOUSE OFFICE BUILDING WASHINGTON, DC 20515",Y,0,,1641 Longworth House Office Building,,Washington DC 20515-0001,Washington,DC,20515,District of Columbia,AAU1,205150001411,,,,Y,38.89106,-77.01132,Zip5,Residential,S,,AL,Q#X#
4,204,247,"1238 Cannon H.O.B.
Washington, DC 20515",Y,0,,1238 Cannon H O B,,Washington DC 20515-0001,Washington,DC,20515,District of Columbia,AAU1,205150001385,,,,Y,38.89106,-77.01132,Zip5,Residential,S,,AL,Q#X#
5,252,316,8293 Longworth House Office Building • Washington DC • 20515,Y,0,,8293 Longworth House Office Building,,Washington DC 20515-0001,Washington,DC,20515,District of Columbia,AAU1,205150001934,,,,Y,38.89106,-77.01132,Zip5,Residential,S,,AL,Q#X#
6,321,381,8293 Longworth House Office Building | Washington DC | 20515,Y,0,,8293 Longworth House Office Building,,Washington DC 20515-0001,Washington,DC,20515,District of Columbia,AAU1,205150001934,,,,Y,38.89106,-77.01132,Zip5,Residential,S,,AL,Q#X#

Took about 2 seconds. This API is free for use up to a point, and there may be others like it; I encourage you to do some looking around to find the option best for you... I guarantee it will be better than writing your own regex (hint: the code-behind of this isn't based on regular expressions).

Matt
  • 22,721
  • 17
  • 71
  • 112
  • Does anyone know of any PHP classes, or some other free alternative to LiveAddress? LA works great, but is too costly for the project I am working on. – Owen McAlack Aug 07 '13 at 07:42
  • 1
    @pXdty Hm... do you need it for a registered non-profit use? If so, you can get LiveAddress unlimited for free. Otherwise, I'll keep my eye open and let you know if I find a library that does it. – Matt Aug 07 '13 at 14:53
  • 1
    @pXdty Can you explain a little bit about the project you are working on? That might help to filter the possible solutions. In summary, it sounds like you want to find a service that can parse through a data source to find, correct, and validate an address (using the most current data from the USPS) and you want the service to be very fast, highly accurate - yet aggressive as well, and at the same time cost you nothing, or very little. Did I accurately sum up what you are looking for? – Jeffrey Aug 07 '13 at 18:06
  • @Jeffrey : We are building a tool to verify local search listings, that will be free use. And yes, we want to be able to parse through large strings that contains addresses and present valid addresses to the user. I currently am using something very dirty that I wrote to do this, but it isn't as reliable as liveaddress. – Owen McAlack Aug 08 '13 at 00:59
  • @Matt : We are writing this for a non-profit. Not sure if they are registered, but I can find out. What do we need to show to get unlimited access? – Owen McAlack Aug 08 '13 at 00:59
  • 1
    @pXdty Just sign up with [this form](https://smartystreets.com/free-address-verification) or contact SmartyStreets. (This is off-topic, so contact SS if you have more questions.) You'll just be asked to put up a link and/or tell people about it. – Matt Aug 08 '13 at 01:22
1

This regex takes a more flexible approach towards what the input string can contain. The "Washington, DC" part has not been hard-coded into it. The different parts of the addresses are captured separately, the whole address will be captured in $matches[0].

$input = strip_tags($input);
preg_match('/
(\d++)    # Number (one or more digits) -> $matches[1]
\s++      # Whitespace
([^,]++), # Building + City (everything up until a comma) -> $matches[2]
\s++      # Whitespace
(\S++)    # "DC" part (anything but whitespace) -> $matches[3]
\s++      # Whitespace
(\d++)    # Number (one or more digits) -> $matches[4]
/x', $input, $matches);
Geert
  • 1,804
  • 15
  • 15
  • This is close, but it assumes there will always be a comma. Please re-evaluate the various formats listed in the original question. – Sampson Dec 26 '09 at 09:14
1

EDIT:

After looking at the sites you mentioned, I think the following should work. Assuming that you have the contents of the page you crawled in a variable called $page, then you could use

$subject = strip_tags($page)

to remove all HTML markup from the page; then apply the regex

(\d+)\s*(.*?)\s*washington.{0,5}(DC|D.C.).{0,5}(\d{5})

RegexBuddy generates the following code for this (I don't know PHP):

if (preg_match('/(\d+)\s*(.*?)\s*washington.{0,5}(DC|D.C.).{0,5}(\d{5})/si', $subject, $regs)) {
    $result = $regs[0];
} else {
    $result = "";
}

$regs[1] would then contain the contents of the first capturing parens (numbers), and so forth.

Note the use of the /si modifiers to make the dot match newlines, and to make the regex case-insensitive.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Close, but these "anything" should probably be limited to 5 chars, max. Right now, this regex brings in paragraphs qualified under the [anything] blocks. My fault though, since I was too vague. – Sampson Dec 26 '09 at 08:41
  • No problem, just replace the `.*?` by `.{0,5}` - I edited my answer accordingly. – Tim Pietzcker Dec 26 '09 at 08:53
  • The following doesn't seem to be matching addresses any longer: `/(\d+).{1,5}washington.{1,5}(DC|D.C.).{1,5}(\d{5})/si` – Sampson Dec 26 '09 at 09:09
  • Ah yes, the first "anything" in your examples contains a lot more than 5 characters: ` LONGWORTH HOUSE OFFICE BUILDING `, for example. So I changed that back to `.*?`. If you need to capture the text here, then enclosed it in parentheses, like `(.*?)`. – Tim Pietzcker Dec 26 '09 at 09:13
  • Oops, good point. This is still not matching the address found on http://giffords.house.gov for unfortunately. I currently have: `/(\d+).{1,35}\swashington.{1,5}(DC|D.C.).{1,5}(\d{5})/si` – Sampson Dec 26 '09 at 09:18
0

You question isn't very clear to me, but if I understood you correctly I guess you could use a DOM parser to match the p tags and then check if any of them has the word "Washington" or if the phone number matches the Washington area.

Alix Axel
  • 151,645
  • 95
  • 393
  • 500