preg_match_all reading sitesource multiple lines and matches

Question

I read my own website with file_get_contents to display specific text. I display the data from interviews and I want to get the interview headline and the time to use on another site (link to the interview).

The relevant code block is in a table.

<td>
    Interview 1
    <small style="color:gray">
        Persons 2
        Cameras 2
    </small>
</td>
<td>
    1018 min
</td>

As you can see, Interview 1 is the headline and the time is 1018. I tried this on my own but somehow the pattern got a little crazy.

preg_match_all('#<td>\s*(.+?)\s*<small style="color:gray">\s*<\/small>\s*<\/td><td>\s*(.+?)\s*<\/td>#is', $mysite, $match)

I used \s* for the line breaks and spaces and (.+?) to match. What's wrong with my search pattern?

You should look in to PHP's [DomDocument](http://php.net/manual/en/class.domdocument.php) instead. Using regex on HTML seldom works out as expected. — M. Eriksson, Jun 18 '16 at 17:26
Generally it's not good to parse xml/html with regex. It can cuase unexpected behavior, as you have noticed — Andreas, Jun 18 '16 at 17:28
Obligatory link to http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Shira, Jun 18 '16 at 17:32

score 1 · Accepted Answer · answered Jun 18 '16 at 17:38

1

First you should use a parser for this, regexs on HTML function expectedly. There are two issues with your regex though.

Issue one:

<small style="color:gray">\s*<\/small>

There isn't just white space between that element.

Issue two:

<\/td><td>

There is a new line between the <td>s.

So:

<td>\s*(.+?)\s*<small style="color:gray">.+?<\/small>\s*<\/td>\s<td>\s*(.+?)\s*<\/td>

should work for you (for this static example). If the small element's content is optional change the + to an *. Note also with a parser these wouldnt have been issues.

answered Jun 18 '16 at 17:38

chris85

23,846
7
34
51

Is the HTML the same for all 3? – chris85 Jun 18 '16 at 17:51
Must be different, or your PHP usage of `$match` is incorrect. https://regex101.com/r/zN2eC6/1 <-regex demo... PHP demo -> https://3v4l.org/2k3v8 – chris85 Jun 18 '16 at 17:55
You changed the regex. `.+` != `.+?`. `.+` is greedy and consumes everything it can; http://www.rexegg.com/regex-quantifiers.html#greedytrap. (that page also is using `preg_match` not `preg_match_all`) – chris85 Jun 18 '16 at 18:02
Oh whoops. It works now, thanks for your help. I really appreciate it! – kilroy_2 Jun 18 '16 at 18:11

score 0 · Answer 2 · answered Jun 18 '16 at 18:41

Here is a solution with DOMDocument:

$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
foreach ($xpath->query('//td/small[@style="color:gray"]') as $small) {
    $td2 = $td = $small->parentNode;
    do $td2 = $td2->nextSibling; while($td2->nodeType != 1);
    $match[] = ["headline" => trim($td->firstChild->textContent), 
                "time" => trim($td2->textContent)];
}
print_r($match);

score 0 · Answer 3 · answered Jun 18 '16 at 18:59

0

That works:

preg_match_all( '#<td>\s*(.*)\s*<small style="color:gray">.*</small>\s*</td>\s*<td>\s*(.*)\s*</td>#is', $mysite, $match);

answered Jun 18 '16 at 18:59

Alejandro Salamanca Mazuelo

1,173
15
21

preg_match_all reading sitesource multiple lines and matches

3 Answers3