4

I have html like this:

  <ul id="video-tags">
            <li><em>Tagged: </em></li>
                    <li><a href="/tags/sports">sports</a>, </li>
                            <li><a href="/tags/entertain">entertain</a>, </li>
                            <li><a href="/tags/funny">funny</a>, </li>
                            <li><a href="/tags/comedy">comedy</a>, </li>
                            <li><a href="/tags/automobile">automobile</a>, </li>
                    <li>more <a href="/tags/"><strong>tags</strong></a>.</li>
  </ul>

How can I extract the sports, entertain, funny, comedy, automobile into string

my php preg_match_all look like this:

preg_match_all('/<a href\="\/tags\/(.*?)\">(.*?)<\/a>, <\/li>/', $this->page, $matches);
echo var_dump($matches);    
echo implode(' ', $tags);  

It does not work.

Kevin M
  • 1,524
  • 17
  • 38
Redbox
  • 1,457
  • 5
  • 17
  • 22
  • 1
    How does it 'not work'? What are you getting? Errors? A different string than you expect? What IS it doing (or not doing)? What is `$tags` supposed to be, where is it set? – PenguinCoder Dec 25 '12 at 18:29
  • my var_dump look like this: array(3) { [0]=> array(0) { } [1]=> array(0) { } [2]=> array(0) { } } – Redbox Dec 25 '12 at 18:31
  • im expecting something like: sports, entertain, funny, comedy, automobile showed inside array or string – Redbox Dec 25 '12 at 18:31
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – shark555 Dec 25 '12 at 19:08

3 Answers3

4

I'm not sure how you're getting $this->page from, however the following should work as you're expecting:

http://ideone.com/KhWkEg

<?php
$page = 'subject string ...';

preg_match_all('/<a href\="\/tags\/(.*?)\">(.*?)<\/a>, <\/li>/', $page, $matches);

echo implode(', ', $matches[1]);  
?>

Substitute the $page variable for your $this->page so long as it is still a string.

However, I'd suggest not trying to parse HTML with Regular Expressions. Instead, use a library like PHP DOM document or SimpleHTMLdom to properly parse HTML.

Community
  • 1
  • 1
PenguinCoder
  • 4,335
  • 1
  • 26
  • 37
2

This small regex does the same thing too.

preg_match_all('|tags/[^>]*>([^<]*)|', $str, $matches);

Also using DOMDocuemnt.

$d = new DOMDocument();
$d->loadHTML($str);
$as = $d->getElementsByTagName('a');
$result = array();
for($i=0;$i<($as->length-1); $i++)
    $result[]=$as->item($i)->textContent;

echo implode(' ', $result);  
Shiplu Mokaddim
  • 56,364
  • 17
  • 141
  • 187
1

This worked perfectly for me:

preg_match_all('/<a href\="\/tags\/(.*?)\">.*?<\/a>, <\/li>/', $str, $matches);
echo implode(',', $matches[1]);

Prints: sports,entertain,funny,comedy,automobile

$this->page is probably empty, that's why you are not getting any data.

Why do you put the brackets twice in regexp? You have the same words both in url and text of the link.

Shiplu Mokaddim
  • 56,364
  • 17
  • 141
  • 187
user4035
  • 22,508
  • 11
  • 59
  • 94