0

I read my own website with file_get_contents to display specific text. I display the data from interviews and I want to get the interview headline and the time to use on another site (link to the interview).

The relevant code block is in a table.

<td>
    Interview 1
    <small style="color:gray">
        Persons 2
        Cameras 2
    </small>
</td>
<td>
    1018 min
</td>

As you can see, Interview 1 is the headline and the time is 1018. I tried this on my own but somehow the pattern got a little crazy.

preg_match_all('#<td>\s*(.+?)\s*<small style="color:gray">\s*<\/small>\s*<\/td><td>\s*(.+?)\s*<\/td>#is', $mysite, $match)

I used \s* for the line breaks and spaces and (.+?) to match. What's wrong with my search pattern?

chris85
  • 23,846
  • 7
  • 34
  • 51
kilroy_2
  • 38
  • 7
  • 1
    You should look in to PHP's [DomDocument](http://php.net/manual/en/class.domdocument.php) instead. Using regex on HTML seldom works out as expected. – M. Eriksson Jun 18 '16 at 17:26
  • 1
    Generally it's not good to parse xml/html with regex. It can cuase unexpected behavior, as you have noticed – Andreas Jun 18 '16 at 17:28
  • 1
    Obligatory link to http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Shira Jun 18 '16 at 17:32

3 Answers3

1

First you should use a parser for this, regexs on HTML function expectedly. There are two issues with your regex though.

Issue one:

<small style="color:gray">\s*<\/small>

There isn't just white space between that element.

Issue two:

<\/td><td>

There is a new line between the <td>s.

So:

<td>\s*(.+?)\s*<small style="color:gray">.+?<\/small>\s*<\/td>\s<td>\s*(.+?)\s*<\/td>

should work for you (for this static example). If the small element's content is optional change the + to an *. Note also with a parser these wouldnt have been issues.

chris85
  • 23,846
  • 7
  • 34
  • 51
  • Is the HTML the same for all 3? – chris85 Jun 18 '16 at 17:51
  • Must be different, or your PHP usage of `$match` is incorrect. https://regex101.com/r/zN2eC6/1 <-regex demo... PHP demo -> https://3v4l.org/2k3v8 – chris85 Jun 18 '16 at 17:55
  • You changed the regex. `.+` != `.+?`. `.+` is greedy and consumes everything it can; http://www.rexegg.com/regex-quantifiers.html#greedytrap. (that page also is using `preg_match` not `preg_match_all`) – chris85 Jun 18 '16 at 18:02
  • Oh whoops. It works now, thanks for your help. I really appreciate it! – kilroy_2 Jun 18 '16 at 18:11
0

Here is a solution with DOMDocument:

$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
foreach ($xpath->query('//td/small[@style="color:gray"]') as $small) {
    $td2 = $td = $small->parentNode;
    do $td2 = $td2->nextSibling; while($td2->nodeType != 1);
    $match[] = ["headline" => trim($td->firstChild->textContent), 
                "time" => trim($td2->textContent)];
}
print_r($match);
trincot
  • 317,000
  • 35
  • 244
  • 286
0

That works:

preg_match_all( '#<td>\s*(.*)\s*<small style="color:gray">.*</small>\s*</td>\s*<td>\s*(.*)\s*</td>#is', $mysite, $match);