Parse HTML in PHP and extract value

Question

I'm trying to extract some information from a website.

There is a section looking like that:

<th>Some text here</th><td>text to extract</td>

I would like to find (with regexp or other solution) the part starting with some text here and extract the text to extract from that.

I was trying to use following regexp solution:

$reg = '/<th>Some text here<\/th><td>(.*)<\/td>/'; 
preg_match_all($reg, $content, $result, PREG_PATTERN_ORDER);

print_r($result);

but it gives me just empty array:

Array ( [0] => Array ( ) [1] => Array ( ) )

How should I construct my regular expression to extract wanted value? Or what other solution can I use to extract it?

Can confirm @Bob0t it works fine. At least regex is correct for sure — Kovpaev Alexey, Aug 05 '16 at 15:41
@mmm: this explanation has nothing to do with modern regex engines *(in particular the one used by PHP)*, it's about "regular expressions" in a computer science meaning. In short, the current question isn't a duplicate of this question since it speaks about something different *(the explanation becomes wrong if you try to apply it to the regex engines used in PHP, Perl, Ruby, .net ...)* — Casimir et Hippolyte, Aug 05 '16 at 15:44
@CasimiretHippolyte still you shouldn't use regex for parsing html. php has it's own DOM parser for this.# — baao, Aug 05 '16 at 15:48
Well, as I said in my question, I'm not sticking to the regexp solution. I just need to extract the value, no matter if I use regexp, dom crawler etc. — Gacek, Aug 05 '16 at 15:56
It works fine for me too. Please see the sandbox: http://sandbox.onlinephpfunctions.com/code/d77a13f7cb5f3c605512c295e1ef16605020bc20 — Alan, Aug 05 '16 at 17:17

Casimir et Hippolyte · Accepted Answer · 2016-08-05T18:21:01.700

Using XPath:

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();

$xp = new DOMXPath($dom);

$content = $xp->evaluate('string(//th[.="Some text here"]/following-sibling::*[1][name()="td"])');

echo $content;

XPath query details:

string(  # return a string instead of a node list
    //   # anywhere in the DOM tree
    th   # a th node
    [.="Some text here"] # predicate: its content is "Some text here"
    /following-sibling::*[1] # first following sibling
    [name()="td"] # predicate: must be a td node
)

The reason your pattern doesn't work is probably because the td content contains newlines characters (that are not matched by the dot .).

hanshenrik · Answer 2 · 2016-08-05T17:15:44.540

you could use a DOMDocument for this.

$domd=@DOMDocument::loadHTML($content);
$extractedText=NULL;
foreach($domd->getElementsByTagName("th") as $ele){
    if($ele->textContent!=='Some text here'){continue;}
    $extractedText=$ele->nextSibling->textContent;
    break;
}
if($extractedText===NULL){
//extraction failed
} else {
//extracted text is in $extractedText
}

(regex is generally a bad tool for parsing HTML, as someone in comments have already pointed out)

Parse HTML in PHP and extract value

2 Answers2