Extract text from HTML

Question

Actors: example world

this example word using regular expression in php .....

You should not use regular expressions to process HTML; better use a HTML parser that can build the corresponding DOM. — Gumbo, Aug 01 '10 at 14:13

score 1 · Answer 1 · answered Aug 01 '10 at 13:49

1

preg_match('/<strong class="nfpd">Actors<\/strong>:([^<]+)<br \/>/', $text, $matches);

print_r($matches);

answered Aug 01 '10 at 13:49

I'd rather use something more clear: `~Actors:(?P[^<]+)
~im` – Mikulas Dite Aug 01 '10 at 14:31
`:(?` Well, that caught me off guard. Silly wordwrapping. – strager Aug 01 '10 at 14:33
You missed the main part: removing the need for escaping slashes and case-insensitive flag. – Mikulas Dite Aug 01 '10 at 15:03
@Mikulas Dite, I understand your revision. =] The sad face caught me off guard when I read your comment. – strager Aug 01 '10 at 15:07

score 1 · Answer 2 · edited May 23 '17 at 11:48

1

Like Gumbo already pointed out in the comments to this question and like you have also been told in a number of your previous questions as well, Regex aint the right tool for parsing HTML.

The following will use DOM to get the first following sibling of any strong elements with a class attribute of nfpd. In the case of the example HTML, this would be the content of the TextNode, e.g. : example world.

Example HTML:

$html = <<< HTML
<p>
    <strong class="nfpd">Actors</strong>: example world <br />
    something else
</p>
HTML;

And extraction with DOM

libxml_use_internal_errors(TRUE);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
libxml_clear_errors();

$nodes = $xPath->query('//strong[@class="nfpd"]/following-sibling::text()[1]');
foreach($nodes as $node) {
    echo $node->nodeValue; // : example world 
}

You can also do it withouth an XPath, though it gets more verbose then:

$nodes = $dom->getElementsByTagName('strong');
foreach($nodes as $node) {
    if($node->hasAttribute('class') &&
       $node->getAttribute('class') === 'nfpd' &&
       $node->nextSibling) {
        echo $node->nextSibling->nodeValue; // : example world 
    }
}

Removing the colon and whitespace is trivial: Use trim.

edited May 23 '17 at 11:48

Community

1
1

answered Aug 01 '10 at 14:26

Gordon

312,688
75
539
559

OP asked for regular expression. Another solution would be fine, but this is way too slow compared to regex. Also he might not have the complete page, or even valid DOM. – Mikulas Dite Aug 01 '10 at 14:33
@Mikulas [It is widely accepted at SO that Regex is not the right tool for HTML parsing](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). Also, DOM doesnt care if you feed it valid HTML or if it is a full page. DOM is slower than Regex, but that's like saying a surgeon should use a chainsaw instead of a bonesaw to amputate, because the chainsaw is quicker. It's still the wrong tool. – Gordon Aug 01 '10 at 14:39
@Gordon Indeed it is, however here we have one line that also happens to be html. Also, this solution you provides does not remove the `: `. – Mikulas Dite Aug 01 '10 at 15:01
@Gordon: The same applies to anaesthesia. ;-) – Gumbo Aug 01 '10 at 15:02
1

@Mikulas I dont understand what you mean by *"however here we have one line that also happens to be html"*. The above will work regardless of the root `
`. Removing the colon (to stay in the surgeon analogy) is trivial. Just add add `trim($node->nodeValue, ': ')`. The point of the example is not to spoonfeed the OP but to show how to use a proper DOM parser - which unfortunately is still something many programmers have no clue about (which is why they try to do it with Regex). There is at least one question a day asking how to manipulate HTML with Regex.
– Gordon Aug 01 '10 at 15:21
@Mikulas Also, if you argue I cannot be sure if I have valid HTML or a complete DOM, then how can you favor a Regex at all? The above Regex will fail until it does not contain the exact character sequence `Actors<\/strong>:`. If there is a second class, it fails. If there is an `id` attribute it fails. If there is one more space anywhere between, it fails. DOM doesnt care about these things. It will work with ` Actors <\/strong> :` – Gordon Aug 01 '10 at 15:40
@Mikulas In addition, if the OP is really only working with the shown string, e.g. it is not part of some larger HTML fragment, then the question should be: why use DOM or Regex at all, because you could match the desired string with strpos, substr or strstr as well and that should be faster than Regex. – Gordon Aug 01 '10 at 20:16
@Gordon I read only the last one: no, compare your solution (or strpos etc.) to single oneliner - regex. I'm sure you agree it's much simpler and shorter. – Mikulas Dite Aug 01 '10 at 20:52
@Mikulas no, I wont agree to that. Doing `trim(strstr(strip_tags($html), ':'), ': ');` or `str_replace('Actors: ','', strip_tags($html));` is simpler, shorter and faster than a Regex. There is absolutely no reason for using a Regex when the OP just has the given string. And if the OP has more than just the shown string, a Regex is not as reliable as DOM as I have already outlined in my previous comments. – Gordon Aug 01 '10 at 22:32

Extract text from HTML

2 Answers2

Linked