0

I have a whole bunch of large HTML documents with tables of data inside and I'm looking to write a script which can process an HTML file, isolate the tags and their contents, then concatenate all the rows within those tables into one large data table. Then loop through the rows and columns of the new large table.

After some research I've started trying out PHP's DOMDocument class to parse the HTML but I just wanted to know, is that the best way to do something like this?

This is what I've got so far...

$dom = new DOMDocument();
$dom->preserveWhiteSpace = FALSE;
@$dom->loadHTMLFile('exrate.html');
$tables = $dom->getElementsByTagName('table');

How do I chop out everything other than the tables and their contents? Then I'd actually like to remove the first table since it's a table of contents. Then loop through all the table rows and build them into one large table.

Anyone got any hints on how to do this? I've been digging through the docs for DOMDocument on php.net but I'm finding the syntax pretty baffling!

Cheers, B

EDIT: Here is a sample of an HTML file with the data tables I'd like to join http://thenetzone.co.uk/exrates/exrate.html

batfastad
  • 1,943
  • 3
  • 27
  • 37
  • Well, DOMDocument is horrible. Try phpQuery or QueryPath or one of the other [Best methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html). But my second advise would be to use one of the more simpleminded regex classes to extract rows from html tables. – mario Feb 04 '11 at 21:27
  • Can you paste some html code of your documents ? & instead of DOMDocument, Php Simple HTML DOM parser can be useful for easy coding & performance issues. [link](http://simplehtmldom.sourceforge.net/manual.htm) – risyasin Feb 04 '11 at 21:28
  • Sorry I should have specified, the HTML file contains multiple tables with the same columns and column orders, but separated by a bunch of text paragraphs between each table. I'll take a look at phpQuery, I'm glad I'm not the only one finding DOMDocument difficult to put together, I like the idea of using phpQuery or QueryPath which are wrappers to DOMDocument. There is a sample of the HTML code here... (http://thenetzone.co.uk/exrates/exrate.html) – batfastad Feb 04 '11 at 22:13

1 Answers1

0

Ok got it sorted with phpQuery and lots of trial and error.
So it takes a whole bunch of tables and moves the contents into the first one, removes the empty tables.
Then loops through each table row and extracts the text from specific columns, in this case the 2nd and 3rd td of each row.

require('phpQuery/phpQuery.php');
$doc = phpQuery::newDocumentFileHTML('exrates_code.html');
pq('table:first')->remove();// REMOVE FIRST TABLE, JUST A CONTENTS TABLE SO NOT INTERESTED
pq('tr:has(th)')->remove();// REMOVE TABLE ROWS THAT ARE HEADERS
pq('table:not(:first) tr')->appendTo('table:first');// MOVE CONTENTS OF OTHER TABLES TO FIRST
pq('table:empty')->remove();// REMOVE EMPTY TABLES
pq('br')->remove();

$rows = pq('table tr');
foreach ($rows as $row) {
    $currency = pq($row)->find('td:eq(1)')->text();
    $value = pq($row)->find('td:eq(2)')->text();
}

Hope this helps someone out!

batfastad
  • 1,943
  • 3
  • 27
  • 37