4

im using php and i need to scrape some information from some curl responses to a site. i am simulating both an ajax request by a browser and a normal (entire) page request by a browser, however the ajax response is slightly different to the entire page request in this section of the html.

the ajax response is: <div id="accountProfile"><h2>THIS IS THE BIT I WANT</h2><dl id="accountProfileData">

however the normal response is: <div id="accountProfile"><html xmlns="http://www.w3.org/1999/xhtml"><h2>THIS IS THE BIT I WANT</h2><dl id="accountProfileData">

ie the ajax response is missing the tag: <html xmlns="http://www.w3.org/1999/xhtml">. i need to get the bits in between the h2 tags. obviously i can't just scrape the page for <h2>THIS IS THE BIT I WANT</h2><dl id="accountProfileData"> since these tags may occur in other places and not contain the information i want.

i can match either one of the patterns individually, however i would like to do both in a single regex. here is my solution for matching the ajax response:

<?php
$pattern = '/\<div id="accountProfile"\>\<h2\>(.+?)\<\/h2\>\<dl id="accountProfileData"\>/';
preg_match($pattern, $haystack, $matches);
print_r($matches);
?>

can someone show me how i should alter the pattern to optionally match the <html xmlns="http://www.w3.org/1999/xhtml"> tag aswell? if it helps to simplify the haystack for the purposes of brevity that's fine.

mulllhausen
  • 4,225
  • 7
  • 49
  • 71
  • The normal response is broken - a `html` element has no place inside the document. I'm not entirely sure what your question is? Have you considered using a DOM parser to parse the HTML? See [Best methods to parse HTML](http://stackoverflow.com/questions/3577641/3577662#3577662) – Pekka May 10 '11 at 07:38
  • broken it may be, but it its there all the same. i didnt write the site i am scraping. ok i updated the requirements a bit – mulllhausen May 10 '11 at 07:39

1 Answers1

2

I haven't tested it, but you can try this:

    $pattern = '/\<div id="accountProfile"\>(\<html xmlns=\"http://www.w3.org/1999/xhtml\"\>){0,1}\<h2\>(.+?)\<\/h2\>\<dl id="accountProfileData"\>/';
Dev.Jaap
  • 152
  • 6
  • that works - as long as you escape everything in `xmlns=\"http://www.w3.org/1999/xhtml` :) also you can simplify `{0,1}` to `?` – mulllhausen May 10 '11 at 07:58
  • i wonder if its possible to write the pattern without brackets around the `html xmlns=...` tag? its no big deal, but php's preg_match creates a new array element for anything matching a pattern in brackets. of course i can just use the final `$matches` array element, but im curious if its possible to avoid matching this unwanted `html xmlns=...` tag pattern. – mulllhausen May 12 '11 at 04:08
  • @mulllhausen: You can use a non-capturing group by adding `?:` to the start of it, so: `(?:\ – porges May 12 '11 at 04:22