Regex to strip string inside specific HTML tag

Question

I'm trying to strip out a string, which occurs only once on a page obtained using cURL. Example:

<h3 class=" ">STRING IN QUESTION</h3>

or

<h3 class="active">STRING IN QUESTION</h3>

or

<h3 class=" active">STRING IN QUESTION</h3>

I would like to do this using preg_match, unless it can be accomplished with a less resource-intensive method.

Here is the regex I'm using, which is producing zero results:

<h3\sclass="\s">(.*?)</h3>

EDIT:

Here is the actual code (an actual URL used here in place of dynamic one) -- discovered that when pulled via cURL, the class attribute does not exist, but still does not work as shown:

$ch = curl_init ("URL IN QUESTION"); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$page = curl_exec($ch);

preg_match('<h3>(.*?)</h3>', $page, $match);

print_r($match);

Prints Nothing

That is correct. Wanting to use this in a share widget to set title for a facebook share, from a mobile site (do not have control over the site in question), the titles of which are static, and do not accurately describe the content. — FurryWombat, Nov 25 '12 at 20:26
You must provide a [delimiter](http://php.net/manual/en/regexp.reference.delimiters.php), however, take a look at my answer. — jacoz, Nov 25 '12 at 20:38
Check if `$page` isn't empty. According to php manual, it'll return `FALSE` on failure. — jacoz, Nov 25 '12 at 20:47
Page wasn't empty, but was pulling from an incorrect URL. Given the proper URL, it works. Thank you. — FurryWombat, Nov 25 '12 at 21:05

Chris Seymour · Accepted Answer · 2012-11-25T21:04:27.847

3

This does the trick:

$str='<h3 class=" ">STRING IN QUESTION</h3>';
preg_match('/<h3.*?>(.*?)<\/h3>/',$str,$match);
print_r($match);

Output:

Array
(
    [0] => <h3 class=" ">STRING IN QUESTION</h3>
    [1] => STRING IN QUESTION
)

Explanation:

<h3.*?> # Match h3 tags (non-greedy)
(.*?)   # Match everything after tag (non-greedy, captured)     
<\/h3>  # Match closing tag - Note the escaped forward slash!

However that URL contains no <h3> tags, it does contain a <h1> tag however and to match it you would need to make the regex match newlines with a trailing s

preg_match('/<h1.*?>(.*?)<\/h1>/s',$page,$match);

Output:

Array
(
    [0] => <h1 class="">
<span class="pageTitle ">Braman Motorcars</span>
</h1>
    [1] => 
<span class="pageTitle ">Braman Motorcars</span>

)

edited Nov 25 '12 at 21:04

answered Nov 25 '12 at 20:39

Chris Seymour

83,387
30
160
202

Thank you for your answer, and for the explanation, and works with the dummy code but not with the cURL. Still producing an empty array. – FurryWombat Nov 25 '12 at 20:46
1

I copied your code and did a `print $page` it doesn't contain any `
` tags?
– Chris Seymour Nov 25 '12 at 20:52
Thank you for pointing out that the URL was incorrect. Any one of the below answers works, given the correct URL. Yours having the explanation for idiots like myself, makes me happy. Thank you. – FurryWombat Nov 25 '12 at 21:00
Here's an interesting question... given the proper code, is there any way to strip out the stuff between H3 tags WITHOUT the use of a regex? In this case, there is only ONE instance of an H3 tag on the page. – FurryWombat Nov 25 '12 at 21:04
Explanations should be given for `noobs` or `regexperts` a like, anyway see my update of multi-line matching. You really don't want to use `regex` for parse XML/HTML. Look at using a XML parser. – Chris Seymour Nov 25 '12 at 21:08
Agreed. Thank you again for your answer, and for your detailed explanation. In searching for a good HTML parser, I came across [this](http://simplehtmldom.sourceforge.net/). Anything better out there that you might recommend? – FurryWombat Nov 25 '12 at 22:04
No problem, See this question http://stackoverflow.com/questions/3577641/how-to-parse-and-process-html-with-php it should be very helpful. – Chris Seymour Nov 25 '12 at 22:07

J-Mik · Answer 2 · 2012-11-25T20:45:44.083

1

Maybe:

<h3\s+class="\s*(active)?">(.*?)</h3>

and then use the \1 to retrieve "active" or "" and \2 for "String in question"

I've never done any php, but maybe this would work?:

$result = "not found"
if (preg_match('#<h3\s+class="\s*(active)?">(.*?)</h3>#', $page, $match))
{
    $result = $match;
}
print_r($result)

edited Nov 25 '12 at 20:45

answered Nov 25 '12 at 20:25

J-Mik

896
7
8

An improvement, but still no results. – FurryWombat Nov 25 '12 at 20:29
Here it's working fine though, what is the regex library/application you're using? – J-Mik Nov 25 '12 at 20:31
Going to post the full code. It's probably something screwed up outside the regex. – FurryWombat Nov 25 '12 at 20:33

score 0 · Answer 3 · answered Nov 25 '12 at 20:37

0

Try with:

preg_match('#<h3\s?class="\s?(active)?">(.+)</h3>#', $yourString, $match);

Remember, in your regex you must always provide a delimiter.

answered Nov 25 '12 at 20:37

jacoz

3,508
5
26
42

Tried this, and #
(.*?)
# -- now getting an array, but is empty. – FurryWombat Nov 25 '12 at 20:41
It should work, I tested and it works! Your string is in `$match[2]`. – jacoz Nov 25 '12 at 20:42
It works with the dummy code, but not with the scraped code (see edited question, with full code posted). Produces an empty array for me. – FurryWombat Nov 25 '12 at 20:47
Obviously `$yourString` must be replaced with `$page`. And check if `$page` is `FALSE` because according to php manual `curl_exec` returns false on failure. – jacoz Nov 25 '12 at 20:52
Variable names were changed. Getting a result now, but an empty one. – FurryWombat Nov 25 '12 at 20:55
Just for saying... are you sure that there is in that page that tag? That seem to be an ajax request... – jacoz Nov 25 '12 at 21:01

Regex to strip string inside specific HTML tag

3 Answers3

` tags?

(.*?)