0

I need to extract large amounts of data from a variety of HTML files, and I will have to write a separate script for each type of HTML file in order to parse out the data I need correctly.

The data will be located in different parts of the document - for example, in document type one, the data I need may be nicely inside a DIV with an ID, but on document type two the only way to locate the data I need may be by finding the certain pattern of tags that contains it (like <div><b>DATA</div></b>).

From the little I've been able to find so far it seems that DOMXPath may be able to help me with at least some of the extraction - what other functions can I use, specifically on the second example of locating an arbitrary pattern of tags and getting their content?

MarathonStudios
  • 3,983
  • 10
  • 40
  • 46

2 Answers2

1

If you are extracting different types of data from a variety of HTML files, you are going to tire quickly from using the DOMDocument API and XPath. Use one of the wrapper libraries listed in How do you parse and process HTML/XML in PHP?. They provide a richer API and additional selectors.

I'm preferring phpQuery and QueryPath which allow for:

print qp($url)->find("body p.article a")->attr("href");

print qp($html)->find("div b")->text();

The usable functions are documented here: http://api.querypath.org/docs/class_query_path.html - it's mostly like jQuery.

Community
  • 1
  • 1
mario
  • 144,265
  • 20
  • 237
  • 291
0

If you plan on parsing many HTML files and you need to select or modify many elements of your HTML files, consider using a library.

I would recommend the library PHPPowertools/DOM-Query, which I wrote myself. It allows you to (1) load an HTML file and then (2) select or change parts of your HTML pretty much the same way you'd do it if you'd be using jQuery in a frontend app.

Example use :

// Select the body tag
$body = $H->select('body');

// Combine different classes as one selector to get all site blocks
$siteblocks = $body->select('.site-header, .masthead, .site-body, .site-footer');

// Nest your methods just like you would with jQuery
$siteblocks->select('button')->add('span')->addClass('icon icon-printer');

// Use a lambda function to set the text of all site blocks
$siteblocks->text(function($i, $val) {
    return $i . " - " . $val->attr('class');
});

// Append the following HTML to all site blocks
$siteblocks->append('<div class="site-center"></div>');

// Use a descendant selector to select the site's footer
$sitefooter = $body->select('.site-footer > .site-center');

// Set some attributes for the site's footer
$sitefooter->attr(array('id' => 'aweeesome', 'data-val' => 'see'));

// Use a lambda function to set the attributes of all site blocks
$siteblocks->attr('data-val', function($i, $val) {
    return $i . " - " . $val->attr('class') . " - photo by Kelly Clark";
});

// Select the parent of the site's footer
$sitefooterparent = $sitefooter->parent();

// Remove the class of all i-tags within the site's footer's parent
$sitefooterparent->select('i')->removeAttr('class');

// Wrap the site's footer within two nex selectors
$sitefooter->wrap('<section><div class="footer-wrapper"></div></section>');

[...]
John Slegers
  • 45,213
  • 22
  • 199
  • 169