0

I am parsing my website (html code) with curl:

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "http://example.com/product.html");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);

$content = curl_exec($ch);

Now i want to find a specific <span> with an <a> the a tag contains an href with a parameter. Is it possible to find this parameter ([eventUid]=22) with preg match? I want to save the 22 (id) that comes from a database to a variable using PHP.

Example:

<span><a title="mytitle" href="http://example.com/products.html?tx_example_pi1[eventUid]=22">example</a></span>
if (preg_match('@((https?://)?([-\w]+\.[-\w\.]+)+\w(:\d+)?(/([-\w/_\.]*(\?\S+)?)?)*)@', $content, $matches)) {
    echo $matches[2];
} else {
    echo 'Nothing found!';
}

At the moment I only found links with this preg search.

Álvaro González
  • 142,137
  • 41
  • 261
  • 360
Jim
  • 923
  • 4
  • 17
  • 30
  • just a suggestion: why not use parse_str.. its much faster.. – Dinesh Apr 09 '13 at 08:01
  • 3
    Doing that with regular expressions looks terribly complicated. I'd suggest to simplify and use [DOM functions](http://www.php.net/manual/en/book.dom.php) and [parse_url()](http://php.net/parse_url) instead. – Álvaro González Apr 09 '13 at 08:03
  • if you found the link, why dont you simply split the string with '=' and get the id (22)? – Raheel Hasan Apr 09 '13 at 08:03
  • i do not find the link what i am searching for...i will try parse url – Jim Apr 09 '13 at 08:09
  • possible duplicate of [How to parse and process HTML/XML?](http://stackoverflow.com/questions/3577641/how-to-parse-and-process-html-xml) – hjpotter92 Apr 09 '13 at 08:13

1 Answers1

1

Using regular expressions to search through HTML is error prone; it's better to use XPath for that:

$doc = new DOMDocument;
$doc->loadHTML($content);
$xp = new DOMXPath($doc);

foreach ($xp->query('//span/a[contains(@href, "[eventUid]=")]') as $anchor) {
    if (preg_match('/\[eventUid\]=(\d+)/', $anchor->getAttribute('href'), $matches)) {
        echo $matches[1];
    }
}
Ja͢ck
  • 170,779
  • 38
  • 263
  • 309
  • what did you mean with $content? the url of my website? And foreach ($xp->query('//span/a[contains(@href, "[eventUid]=22")]') is not possible because the number is dynamic this would be better foreach ($xp->query('//span/a[contains(@href, "[eventUid]=")]') ? – Jim Apr 09 '13 at 09:30
  • @Jim You already have `$content` coming from `curl_exec()`. I've updated the XPath and updated the code inside the loop. – Ja͢ck Apr 09 '13 at 09:31
  • thanks, but he doesnt go inside the foreach :( var_dump($xp) returns this object(DOMXPath)#179 (0) { } – Jim Apr 09 '13 at 09:44
  • @Jim Well, it works [here](http://codepad.viper-7.com/IABtBw), which is the HTML you gave earlier. – Ja͢ck Apr 09 '13 at 09:46
  • hmm yes your example works...it seems to be a problem to parse the website http://codepad.viper-7.com/E0rccz – Jim Apr 09 '13 at 11:13
  • $content = curl_exec($ch); // output string(14909) $doc = new DOMDocument; var_dump($doc); object(DOMDocument)#178 (0) { } $doc->loadHTML($content); var_dump($doc); object(DOMDocument)#178 (0) { } $xp = new DOMXPath($doc); var_dump($xp); object(DOMXPath)#179 (0){ } – Jim Apr 09 '13 at 12:15
  • @Jim $doc->saveHTML() should show you what it managed to parse. – Ja͢ck Apr 09 '13 at 12:31
  • $doc-saveHTML($content) gives me string(14909) via var_dump and with echo the whole website – Jim Apr 09 '13 at 12:41
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/27858/discussion-between-jim-and-jack) – Jim Apr 09 '13 at 12:45
  • @Jim Well, the website doesn't contain any links with `[eventUid]`; see [here](http://codepad.viper-7.com/NL3mfm). – Ja͢ck Apr 09 '13 at 12:46