0

I'm trying to strip out a string, which occurs only once on a page obtained using cURL. Example:

<h3 class=" ">STRING IN QUESTION</h3>

or

<h3 class="active">STRING IN QUESTION</h3>

or

<h3 class=" active">STRING IN QUESTION</h3>

I would like to do this using preg_match, unless it can be accomplished with a less resource-intensive method.

Here is the regex I'm using, which is producing zero results:

<h3\sclass="\s">(.*?)</h3>

EDIT:

Here is the actual code (an actual URL used here in place of dynamic one) -- discovered that when pulled via cURL, the class attribute does not exist, but still does not work as shown:

$ch = curl_init ("URL IN QUESTION"); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$page = curl_exec($ch);

preg_match('<h3>(.*?)</h3>', $page, $match);

print_r($match);

Prints Nothing

FurryWombat
  • 816
  • 2
  • 12
  • 28

3 Answers3

3

This does the trick:

$str='<h3 class=" ">STRING IN QUESTION</h3>';
preg_match('/<h3.*?>(.*?)<\/h3>/',$str,$match);
print_r($match);

Output:

Array
(
    [0] => <h3 class=" ">STRING IN QUESTION</h3>
    [1] => STRING IN QUESTION
)

Explanation:

<h3.*?> # Match h3 tags (non-greedy)
(.*?)   # Match everything after tag (non-greedy, captured)     
<\/h3>  # Match closing tag - Note the escaped forward slash!

However that URL contains no <h3> tags, it does contain a <h1> tag however and to match it you would need to make the regex match newlines with a trailing s

preg_match('/<h1.*?>(.*?)<\/h1>/s',$page,$match);

Output:

Array
(
    [0] => <h1 class="">
<span class="pageTitle ">Braman Motorcars</span>
</h1>
    [1] => 
<span class="pageTitle ">Braman Motorcars</span>

)
Chris Seymour
  • 83,387
  • 30
  • 160
  • 202
  • Thank you for your answer, and for the explanation, and works with the dummy code but not with the cURL. Still producing an empty array. – FurryWombat Nov 25 '12 at 20:46
  • 1
    I copied your code and did a `print $page` it doesn't contain any `

    ` tags?

    – Chris Seymour Nov 25 '12 at 20:52
  • Thank you for pointing out that the URL was incorrect. Any one of the below answers works, given the correct URL. Yours having the explanation for idiots like myself, makes me happy. Thank you. – FurryWombat Nov 25 '12 at 21:00
  • Here's an interesting question... given the proper code, is there any way to strip out the stuff between H3 tags WITHOUT the use of a regex? In this case, there is only ONE instance of an H3 tag on the page. – FurryWombat Nov 25 '12 at 21:04
  • Explanations should be given for `noobs` or `regexperts` a like, anyway see my update of multi-line matching. You really don't want to use `regex` for parse XML/HTML. Look at using a XML parser. – Chris Seymour Nov 25 '12 at 21:08
  • Agreed. Thank you again for your answer, and for your detailed explanation. In searching for a good HTML parser, I came across [this](http://simplehtmldom.sourceforge.net/). Anything better out there that you might recommend? – FurryWombat Nov 25 '12 at 22:04
  • No problem, See this question http://stackoverflow.com/questions/3577641/how-to-parse-and-process-html-with-php it should be very helpful. – Chris Seymour Nov 25 '12 at 22:07
1

Maybe:

<h3\s+class="\s*(active)?">(.*?)</h3>

and then use the \1 to retrieve "active" or "" and \2 for "String in question"

I've never done any php, but maybe this would work?:

$result = "not found"
if (preg_match('#<h3\s+class="\s*(active)?">(.*?)</h3>#', $page, $match))
{
    $result = $match;
}
print_r($result)
J-Mik
  • 896
  • 7
  • 8
0

Try with:

preg_match('#<h3\s?class="\s?(active)?">(.+)</h3>#', $yourString, $match);

Remember, in your regex you must always provide a delimiter.

jacoz
  • 3,508
  • 5
  • 26
  • 42