PHP regex dynamic strings

Question

So this is the part of html content from which i am scraping :

<div class="sms-separator"></div>
<div class="wallpaper-ads-right">
  <b>Wallpaper:</b> 
     Rayman Legends Game sms<br />
  <b>Categories: </b>
     <a href="/games-desktop-wallpapers.html" title="Games wallpapers"> Games</a>
  <br /><b>

What I need is to get the text in place of 'Games' On page refresh it will be like

<div class="sms-separator"></div>
 <div class="wallpaper-ads-right">
    <b>Wallpaper:</b> 
      Souya ssss<br />
    <b>Categories: </b>
      <a href="/soutss-tourguides" title="Tour"> Tourist</a><br /><b>

Now from the above content I want to scrape " Tourist "

The problem is the a href and title tag before that have dynamic contents, they vary from page to page so how can I put that in the Regular Expression ?

So in other words, you want the contents of the `href` tag that immediately follows `Categories: `? — Tim Pietzcker, Jun 25 '13 at 08:42
Why not using a real parser for this like DOM? Parsing html with regexpes is [not good for you](http://stackoverflow.com/a/1732454/1515540). — complex857, Jun 25 '13 at 08:42
Parsing HTML with regular expressions is [generally looked down upon](http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html). There are more robust solutions. Are you open to these? — Brigand, Jun 25 '13 at 08:43
You might need this: http://stackoverflow.com/questions/11898998/regex-match-non-greedy — f.ardelian, Jun 25 '13 at 08:49
Without knowing how the values vary nor the specific target (is it the content withni the href/title attribute or the attribute itself) it's impossible to answer this question. — symcbean, Jun 25 '13 at 08:51
The value of categories is games , i want the test Games to be my scrapped string @TimPietzcker — CartMan, Jun 25 '13 at 09:37
I know which *text* you want in your examples, my question was "What is the *rule* the regex can follow to find it?". — Tim Pietzcker, Jun 25 '13 at 12:52
possible duplicate of [Parsing and processing HTML/XML?](http://stackoverflow.com/questions/3577641/parsing-and-processing-html-xml) — Justin, Jun 25 '13 at 13:49

score 0 · Answer 1 · answered Jun 25 '13 at 12:30

0

<?php
while ($line = fgets(STDIN))
    if (preg_match('?<a href=".*" title=".*">(.*)</a>?', $line, $match))
        echo $match[1], "\n";
?>

answered Jun 25 '13 at 12:30

Armali

18,255
14
57
171

Ro Yo Mi · Accepted Answer · 2013-06-25T16:12:58.660

Description

This expression will capture the title of the section and the href & title of each link. I left this as a multiline expression to help with readability. The multiline regex does require the x ignore whitespace in pattern option

<b>[\w\s]+:\s*<\/b>.*?
<a\b(?=\s)
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\shref=('[^']*'|"[^"]*"|[^'"][^\s>]*))
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\stitle=('[^']*'|"[^"]*"|[^'"][^\s>]*))

enter image description here

Expanded

<b>[\w\s]+:\s*<\/b>.*? finds the catagory header and captures the text before the :
<a\b(?=\s) match the open anchor tag
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\shref=('[^']*'|"[^"]*"|[^'"][^\s>]*)) collects the href value, note the additional fluff here is to prevent odd edge cases and allows the attribute to appear in any order inside the tag
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\stitle=('[^']*'|"[^"]*"|[^'"][^\s>]*)) collects the title value, same fluff as in the href match above

PHP Code Example:

Input Text

<div class="sms-separator"></div>
<div class="wallpaper-ads-right">
  <b>Wallpaper:</b> 
     Rayman Legends Game sms<br />
  <b>Categories: </b>
     <a href="/games-desktop-wallpapers.html" title="Games wallpapers"> Games</a>
  <br /><b>
<div class="sms-separator"></div>
 <div class="wallpaper-ads-right">
    <b>Wallpaper:</b> 
      Souya ssss<br />
    <b>Categories: </b>
      <a href="/soutss-tourguides" title="Tour"> Tourist</a><br /><b>

Code

<?php
$sourcestring="your source string";
preg_match_all('/<b>([\w\s]+):\s*<\/b>[\s\r\n]*?
<a\b(?=\s)
(?=(?:[^>=]|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?\shref=(\'[^\']*\'|"[^"]*"|[^\'"][^\s>]*))
(?=(?:[^>=]|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?\stitle=(\'[^\']*\'|"[^"]*"|[^\'"][^\s>]*))/imsx',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>

Matches

$matches Array:
(
    [0] => Array
        (
            [0] => <b>Categories: </b>
     <a
            [1] => <b>Categories: </b>
      <a
        )

    [1] => Array
        (
            [0] => Categories
            [1] => Categories
        )

    [2] => Array
        (
            [0] => "/games-desktop-wallpapers.html"
            [1] => "/soutss-tourguides"
        )

    [3] => Array
        (
            [0] => "Games wallpapers"
            [1] => "Tour"
        )

)

PHP regex dynamic strings

2 Answers2

Description

Expanded

PHP Code Example: