1

I'm trying to extract some information from a website.

There is a section looking like that:

<th>Some text here</th><td>text to extract</td>

I would like to find (with regexp or other solution) the part starting with some text here and extract the text to extract from that.

I was trying to use following regexp solution:

$reg = '/<th>Some text here<\/th><td>(.*)<\/td>/'; 
preg_match_all($reg, $content, $result, PREG_PATTERN_ORDER);

print_r($result);

but it gives me just empty array:

Array ( [0] => Array ( ) [1] => Array ( ) )

How should I construct my regular expression to extract wanted value? Or what other solution can I use to extract it?

Gacek
  • 10,184
  • 9
  • 54
  • 87
  • 1
    This works fine ... unable to reproduce your prob ... – Bobot Aug 05 '16 at 15:36
  • 1
    Can confirm @Bob0t it works fine. At least regex is correct for sure – Kovpaev Alexey Aug 05 '16 at 15:41
  • 1
    @mmm: this explanation has nothing to do with modern regex engines *(in particular the one used by PHP)*, it's about "regular expressions" in a computer science meaning. In short, the current question isn't a duplicate of this question since it speaks about something different *(the explanation becomes wrong if you try to apply it to the regex engines used in PHP, Perl, Ruby, .net ...)* – Casimir et Hippolyte Aug 05 '16 at 15:44
  • @CasimiretHippolyte still you shouldn't use regex for parsing html. php has it's own DOM parser for this.# – baao Aug 05 '16 at 15:48
  • Well, as I said in my question, I'm not sticking to the regexp solution. I just need to extract the value, no matter if I use regexp, dom crawler etc. – Gacek Aug 05 '16 at 15:56
  • It works fine for me too. Please see the sandbox: http://sandbox.onlinephpfunctions.com/code/d77a13f7cb5f3c605512c295e1ef16605020bc20 – Alan Aug 05 '16 at 17:17

2 Answers2

3

Using XPath:

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();

$xp = new DOMXPath($dom);

$content = $xp->evaluate('string(//th[.="Some text here"]/following-sibling::*[1][name()="td"])');

echo $content;

XPath query details:

string(  # return a string instead of a node list
    //   # anywhere in the DOM tree
    th   # a th node
    [.="Some text here"] # predicate: its content is "Some text here"
    /following-sibling::*[1] # first following sibling
    [name()="td"] # predicate: must be a td node
)

The reason your pattern doesn't work is probably because the td content contains newlines characters (that are not matched by the dot .).

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
1

you could use a DOMDocument for this.

$domd=@DOMDocument::loadHTML($content);
$extractedText=NULL;
foreach($domd->getElementsByTagName("th") as $ele){
    if($ele->textContent!=='Some text here'){continue;}
    $extractedText=$ele->nextSibling->textContent;
    break;
}
if($extractedText===NULL){
//extraction failed
} else {
//extracted text is in $extractedText
}

(regex is generally a bad tool for parsing HTML, as someone in comments have already pointed out)

hanshenrik
  • 19,904
  • 4
  • 43
  • 89