PHP: Preg_match_all to extract html into string

Question

I have html like this:

  <ul id="video-tags">
            <li><em>Tagged: </em></li>
                    <li><a href="/tags/sports">sports</a>, </li>
                            <li><a href="/tags/entertain">entertain</a>, </li>
                            <li><a href="/tags/funny">funny</a>, </li>
                            <li><a href="/tags/comedy">comedy</a>, </li>
                            <li><a href="/tags/automobile">automobile</a>, </li>
                    <li>more <a href="/tags/"><strong>tags</strong></a>.</li>
  </ul>

How can I extract the sports, entertain, funny, comedy, automobile into string

my php preg_match_all look like this:

preg_match_all('/<a href\="\/tags\/(.*?)\">(.*?)<\/a>, <\/li>/', $this->page, $matches);
echo var_dump($matches);    
echo implode(' ', $tags);

It does not work.

How does it 'not work'? What are you getting? Errors? A different string than you expect? What IS it doing (or not doing)? What is `$tags` supposed to be, where is it set? — PenguinCoder, Dec 25 '12 at 18:29
my var_dump look like this: array(3) { [0]=> array(0) { } [1]=> array(0) { } [2]=> array(0) { } } — Redbox, Dec 25 '12 at 18:31
im expecting something like: sports, entertain, funny, comedy, automobile showed inside array or string — Redbox, Dec 25 '12 at 18:31
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — shark555, Dec 25 '12 at 19:08

score 4 · Accepted Answer · edited May 23 '17 at 11:43

I'm not sure how you're getting $this->page from, however the following should work as you're expecting:

http://ideone.com/KhWkEg

<?php
$page = 'subject string ...';

preg_match_all('/<a href\="\/tags\/(.*?)\">(.*?)<\/a>, <\/li>/', $page, $matches);

echo implode(', ', $matches[1]);  
?>

Substitute the $page variable for your $this->page so long as it is still a string.

However, I'd suggest not trying to parse HTML with Regular Expressions. Instead, use a library like PHP DOM document or SimpleHTMLdom to properly parse HTML.

Shiplu Mokaddim · Answer 2 · 2012-12-26T02:42:07.280

2

This small regex does the same thing too.

preg_match_all('|tags/[^>]*>([^<]*)|', $str, $matches);

Also using DOMDocuemnt.

$d = new DOMDocument();
$d->loadHTML($str);
$as = $d->getElementsByTagName('a');
$result = array();
for($i=0;$i<($as->length-1); $i++)
    $result[]=$as->item($i)->textContent;

echo implode(' ', $result);

edited Dec 26 '12 at 02:42

answered Dec 25 '12 at 18:44

Shiplu Mokaddim

56,364
17
141
187

score 1 · Answer 3 · edited Dec 25 '12 at 18:41

1

This worked perfectly for me:

preg_match_all('/<a href\="\/tags\/(.*?)\">.*?<\/a>, <\/li>/', $str, $matches);
echo implode(',', $matches[1]);

Prints: sports,entertain,funny,comedy,automobile

$this->page is probably empty, that's why you are not getting any data.

Why do you put the brackets twice in regexp? You have the same words both in url and text of the link.

edited Dec 25 '12 at 18:41

Shiplu Mokaddim

56,364
17
141
187

answered Dec 25 '12 at 18:30

user4035

22,508
11
59
94

PHP: Preg_match_all to extract html into string

3 Answers3