-2

So this is the part of html content from which i am scraping :

<div class="sms-separator"></div>
<div class="wallpaper-ads-right">
  <b>Wallpaper:</b> 
     Rayman Legends Game sms<br />
  <b>Categories: </b>
     <a href="/games-desktop-wallpapers.html" title="Games wallpapers"> Games</a>
  <br /><b>

What I need is to get the text in place of 'Games' On page refresh it will be like

<div class="sms-separator"></div>
 <div class="wallpaper-ads-right">
    <b>Wallpaper:</b> 
      Souya ssss<br />
    <b>Categories: </b>
      <a href="/soutss-tourguides" title="Tour"> Tourist</a><br /><b>

Now from the above content I want to scrape " Tourist "

The problem is the a href and title tag before that have dynamic contents, they vary from page to page so how can I put that in the Regular Expression ?

Imane Fateh
  • 2,418
  • 3
  • 19
  • 23
CartMan
  • 37
  • 7
  • So in other words, you want the contents of the `href` tag that immediately follows `Categories: `? – Tim Pietzcker Jun 25 '13 at 08:42
  • 3
    Why not using a real parser for this like DOM? Parsing html with regexpes is [not good for you](http://stackoverflow.com/a/1732454/1515540). – complex857 Jun 25 '13 at 08:42
  • Parsing HTML with regular expressions is [generally looked down upon](http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html). There are more robust solutions. Are you open to these? – Brigand Jun 25 '13 at 08:43
  • You might need this: http://stackoverflow.com/questions/11898998/regex-match-non-greedy – f.ardelian Jun 25 '13 at 08:49
  • Without knowing how the values vary nor the specific target (is it the content withni the href/title attribute or the attribute itself) it's impossible to answer this question. – symcbean Jun 25 '13 at 08:51
  • The value of categories is games , i want the test Games to be my scrapped string @TimPietzcker – CartMan Jun 25 '13 at 09:37
  • I know which *text* you want in your examples, my question was "What is the *rule* the regex can follow to find it?". – Tim Pietzcker Jun 25 '13 at 12:52
  • possible duplicate of [Parsing and processing HTML/XML?](http://stackoverflow.com/questions/3577641/parsing-and-processing-html-xml) – Justin Jun 25 '13 at 13:49

2 Answers2

0
<?php
while ($line = fgets(STDIN))
    if (preg_match('?<a href=".*" title=".*">(.*)</a>?', $line, $match))
        echo $match[1], "\n";
?>
Armali
  • 18,255
  • 14
  • 57
  • 171
0

Description

This expression will capture the title of the section and the href & title of each link. I left this as a multiline expression to help with readability. The multiline regex does require the x ignore whitespace in pattern option

<b>[\w\s]+:\s*<\/b>.*?
<a\b(?=\s)
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\shref=('[^']*'|"[^"]*"|[^'"][^\s>]*))
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\stitle=('[^']*'|"[^"]*"|[^'"][^\s>]*))

enter image description here

Expanded

  • <b>[\w\s]+:\s*<\/b>.*? finds the catagory header and captures the text before the :
  • <a\b(?=\s) match the open anchor tag
  • (?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\shref=('[^']*'|"[^"]*"|[^'"][^\s>]*)) collects the href value, note the additional fluff here is to prevent odd edge cases and allows the attribute to appear in any order inside the tag
  • (?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\stitle=('[^']*'|"[^"]*"|[^'"][^\s>]*)) collects the title value, same fluff as in the href match above

PHP Code Example:

Input Text

<div class="sms-separator"></div>
<div class="wallpaper-ads-right">
  <b>Wallpaper:</b> 
     Rayman Legends Game sms<br />
  <b>Categories: </b>
     <a href="/games-desktop-wallpapers.html" title="Games wallpapers"> Games</a>
  <br /><b>
<div class="sms-separator"></div>
 <div class="wallpaper-ads-right">
    <b>Wallpaper:</b> 
      Souya ssss<br />
    <b>Categories: </b>
      <a href="/soutss-tourguides" title="Tour"> Tourist</a><br /><b>

Code

<?php
$sourcestring="your source string";
preg_match_all('/<b>([\w\s]+):\s*<\/b>[\s\r\n]*?
<a\b(?=\s)
(?=(?:[^>=]|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?\shref=(\'[^\']*\'|"[^"]*"|[^\'"][^\s>]*))
(?=(?:[^>=]|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?\stitle=(\'[^\']*\'|"[^"]*"|[^\'"][^\s>]*))/imsx',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>

Matches

$matches Array:
(
    [0] => Array
        (
            [0] => <b>Categories: </b>
     <a
            [1] => <b>Categories: </b>
      <a
        )

    [1] => Array
        (
            [0] => Categories
            [1] => Categories
        )

    [2] => Array
        (
            [0] => "/games-desktop-wallpapers.html"
            [1] => "/soutss-tourguides"
        )

    [3] => Array
        (
            [0] => "Games wallpapers"
            [1] => "Tour"
        )

)
Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43