Parsing HTML tables via DOM

Question

I believe the mark up of the page is part of the issue I am having, so I think I need to post the source and a JSFiddle JSFiddle and the orginal GIS page

I am trying to get info such as Name: and Address: from the table at the bottom.

attempt at a solution:

I wrote the following code hoping to see all the table data, yet the table I'm looking to get data from returns nothing.

 <?php
 $k=0;
 $num=1000;
 var_dump(libxml_use_internal_errors(true));
 $domOb = new DOMDocument();
 $html = @$domOb->loadHTMLFile('http://www.gis.catawba.nc.us/website/Parcel/parcel_main.asp?Cmd=query&key=372215634301&type=P');
 $domOb->preserveWhiteSpace = false; 
 $items = $domOb->getElementsByTagName('td'); 
 while ($k<(int)$num){
 echo $items->item($k++)->nodeValue.'<br>'; 
 };
 ?>

all that returned was:

bool(false) Real Estate Search - Legacy Map Layers visible FAQ's Help GIS Home

So I'm hoping someone can tell me what I'm doing wrong to miss all the data I'm looking for? How can I pull just the name and address as easily/simply as possible?

attempted the following as well using Xpath but get lots of warning...

 $dom = new DOMDocument;
 $dom->load('http://www.gis.catawba.nc.us/website/Parcel/parcel_main.asp?Cmd=query&key=372215634301&type=P');
 $s = simplexml_import_dom($dom);

 echo $name = $s->xpath('//table[@class="words13]/td[contains(text(), "Name:")]');
 echo $add = $s->xpath('//table[@class="words13]/td[contains(text(), Address:)]');

Using the code by user2518542 and combined with hakre code i get the following

 $ch = curl_init();  
 curl_setopt($ch, CURLOPT_URL,"http://www.gis.catawba.nc.us/website/Parcel/parcel_main.asp?Cmd=QUERY&key=372215634301&type=P&width=1280&height=923");
 curl_setopt($ch, CURLOPT_TIMEOUT, 30); //timeout after 30 seconds
 curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
 $result=curl_exec ($ch);
 curl_close ($ch);
 $doc->loadHTML($result);

 $tds = $doc->getElementsByTagname('td');
 foreach($tds as $td) {
 printf(" * %s\n", $td->textContent);
 echo '<br>';
 }

The following successfully prints out all the tags.

the problem is that that td does not even show up when i cycle through all td — tyler, Jun 25 '13 at 04:21

hakre · Accepted Answer · 2013-06-25T05:09:58.257

The table cells you are looking for are not part of that HTML document. You first of all need to understand the basics of webscraping, I suggest you borrow some books about the topic and read through them.

Time for the library ;)

In case the table cells are in the document (it seems to vary, sometimes they are, sometimes they are not), the original example shows it, this also demonstrates how to iterate over a DOMNodeList:

$doc = new DOMDocument();

libxml_use_internal_errors(true);
$doc->loadHTMLFile('Catawba County Legacy Map Server.html');

$tds = $doc->getElementsByTagname('td');
foreach($tds as $td) {
    printf(" * %s\n", $td->textContent);
}

Exemplary output:

php "test.php" (in directory: /home/hakre/php/test)
 *
 * Real Estate Search - Legacy
 *
 *
 *
 *
 *
 *
 *
 *
 *
 * Map Layers
 * visible
 *
 *
 * Parcels
 *
 * Parcel Annotation
 *
 * Address Points
 *
 * Misc. Lines
 *
 * Structures
 *
 * Contour Lines
 *
 * Soils
 *
 * Townships
 *
 * Water Features
 *
 * Tiles
 *
 * Flood Zone
 *
 * Agricultural District
 *
 * Aerial 2009
 *
 * Aerial 2005
 *
 * Aerial 2002
 *
 * Cities
 *
 * Print the Map  
 * Print Map and Parcel Report  
 * Print the Parcel Report  
 * Assessment Report  
 * List all Owners  
 * Deed History Report
 * Parcel Information:
 * Owner Information:
 * Parcel ID: 372215634301
 * Name: PENLEY TREASURE B
 * Parcel Address: 3152 7TH AV SE 
 * Name2:  
 * City: CONOVER 28613
 * Address: 5508 SWINGING BRIDGE RD
 * LRK(REID): 57186
 * Address2:  
 * Deed Book/Page: 1906/0741 Deed Image
 * City: CONOVER
 * Subdivision: FOREST HGTS
 * State/Zip: NC 28613-7415
 * Lots: 1-4
 *
 * Block: C
 *
 * Last Sale:
 * School Information:
 * Plat Book/Page: 8/119 Plat Image
 * School District: COUNTY
 * Calculated Acreage: 0.31
 * Elementary School: WEBB A MURRAY
 * Tax Map: 167H  04006A
 * Middle School: ARNDT
 * State Road:  
 * High School: ST STEPHENS
 * Township: HICKORY
 * School Map
 *  
 *  
 * Tax/Value Information:  Tax Rates(pdf)
 * Zoning Information:
 * Municipal Tax District:  
 * Zoning District: HICKORY
 * Fire District: HICKORY RURAL
 * Zoning1: OI
 * Tax Account Number:  
 * Zoning2:  
 * Market Building(s) Value: $55,400
 * Zoning3:  
 * Market Land Value: $20,300
 * Zoning Overlay:  
 * Market Total Value: $75,700
 * Small Area:  
 * Year Built/Remodeled: 1959  
 * Split Zoning District 1/2: 0/0
 * Current Tax Bill
 * Zoning Agency Phone Numbers
 * Miscellaneous:
 *  
 * Voter Precinct:P35
 * Firm Panel Date: 9/5/2007
 * Building Permits for this parcel
 * Firm Panel #: 3710372200J
 * WaterShed:  
 * 2010 Census Tract: 011000
 * WaterShed Split:  
 * 2010 Census Block: 3035
 * Parcel Report Data Descriptions
 * Agricultural District:  
 * FAQ's
 * Help
 * GIS Home
Compilation finished successfully.

So to get the need name and address you suggest the Xpath as well? — tyler, Jun 25 '13 at 04:34
with something called `foreach` but I'm actually seeing you do it, I did thought you would have made the same mistake as earlier where you set $num to 1. Btw, you do everything right, it's just that the document does not have anymore cells. — hakre, Jun 25 '13 at 04:44
anymore cells... so where did the fields I'm looking for go? — tyler, Jun 25 '13 at 04:57
As written, time to read some books boy. It's no miracle, only the power of understanding. What you learn by that is also useful for other angles of webdevelopment, so don't fear the work it takes to do learning. — hakre, Jun 25 '13 at 04:59
So im right to assume that when I parse the file in the is not there. How can a table not be part of the html and is what im wanting to do still possible? — tyler, Jun 25 '13 at 05:00
As written, time to study the matters. If you want to do the shortcut, read some books, they work well in transferring knowledge into your thinking in a compact way. Don't ask what is possible, instead get into the know how things work and make things possible! — hakre, Jun 25 '13 at 05:01
you should just post that last comment on every question on this site im sure it applies... Sharing knowledge is the purpose of this site...the idea that someone else might find this page and have the same problem...im sure after reading your comment they will have all the answers they need. — tyler, Jun 25 '13 at 05:10
well, your code *does* already iterate over all TD elements. There is nothing more it can do not what we can add to it. See my last edit. Even my code is different it technically does the same. It's just that the website refuses you to give you all the data from time to time. So that is independent to your code. This understanding is crucial but let alone you should not manifest it on the little information but by understanding how client and server in hypertext work. That you can't put into a single answer. — hakre, Jun 25 '13 at 05:12
When I run your iteration i do not get the same Exemplary output as you. So the loadHTMLFile() only pulls data from the front end and not injected data from the back end? — tyler, Jun 25 '13 at 05:24

score 1 · Answer 2 · answered Jun 25 '13 at 04:00

1

Use XPath to look for //table[@class="words13]/td[contains(text(), 'Name:')] and //table[@class="words13]/td[contains(text(), 'Address:')]

answered Jun 25 '13 at 04:00

000

26,951
10
71
101

a full example would look like what exactly `echo $domOb->xpath('//table[@class="words13]/td[contains(text(), 'Name:')]');` is this your suggestion> – tyler Jun 25 '13 at 04:20
@tman: If you spot a new method, look it up in the manual to get examples. Joe Franbach here suggests you to import the document element into simplexml http://php.net/simplexml_import_dom and then use the SimpleXMLElement::xpath() method http://php.net/simplexmlelement.xpath . – hakre Jun 25 '13 at 04:30
i updated my question to reflect my attempt at your solution. I got nothing but lots or warning any idea? – tyler Jun 25 '13 at 04:56

score 1 · Answer 3 · answered Jun 25 '13 at 04:00

1

Try this

$ch = curl_init();  
curl_setopt($ch, CURLOPT_URL,"http://www.gis.catawba.nc.us/website/Parcel/parcel_main.asp?    Cmd=QUERY&key=372215634301&type=P&width=1280&height=923");
curl_setopt($ch, CURLOPT_TIMEOUT, 30); //timeout after 30 seconds
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$result=curl_exec ($ch);
curl_close ($ch);
echo $result;exit;

you will get full page source and then you can simply get watever you want through pregreplace.

answered Jun 25 '13 at 04:00

user2518542

29
2

Could you elaborate on how you would use pregreplace to get the name and address? – tyler Jun 25 '13 at 04:18
1

to get name preg_match("@>Name:([^<>]+)<@is",$result,$name); – user2518542 Jun 25 '13 at 04:23
1

to get address preg_match("@>Address:([^<>]+)<@is",$result,$address); – user2518542 Jun 25 '13 at 04:24
This did work however I'm wanting a different type of solution. Thanks though i do have a back up plan now! – tyler Jun 25 '13 at 04:27

Parsing HTML tables via DOM

3 Answers3

Linked