0

Possible Duplicate:
regex help with getting tag content in PHP

At first, please no comment about parsing html with regex. I know that it is not possible but it should do its job in this case.

I try to get the content of <country lan="x">...</country> tags. There is no special case like <country /> and the PHP DOM Parser fails due to the content of the tags which contains many special chars (MediaWiki text).

So I have some text like

    <country lan="en">


    dsadasd


    {|,'''""" }}|]][][]//\\\\\2r2erfaf<>><<<#<div> --..,;;"!"§$%&/()=?`´´``***+~~~''

    0131ß

    ÄÜÖ#ax
    </country>

My solution at the moment is $pattern = <country lan=\"en\">(.|\t|\r|\n|\s)*<\/country> which seems to match using

preg_match_all($pattern, $content, $matches);
print_r($matches);

but the printed result is just an empty array. How can I extract only the string between the <country lan="x">...</country> tags?

Community
  • 1
  • 1
dnl
  • 365
  • 1
  • 4
  • 14
  • If I got it right, the OP cannot use DOM parsers because the HTML is invalid. – Álvaro González Nov 23 '12 at 09:44
  • 1
    If this is too complicated with an regex, why just you don't look for the first string, then for the second string and get the substring between both positions? Especially as start and end are fixed strings. Just saying DOM does not work for you, it's also clear that regex is too complicated for you, too. So just do standard string manipulation instead. – hakre Nov 23 '12 at 10:00
  • I think the DOM Parser does not do the trick because there is mixed up content of wiki markup and html between the tags -- so it seems to be invalid. "Standard string manipulation" is quite harder than using regex, because there can be several `...` tags per site. – dnl Nov 23 '12 at 10:06

1 Answers1

1

Use this one

preg_match_all('/<country.*?>(.*?)<\/country>/s', $contents,$hits);
print_r($hits);
hakre
  • 193,403
  • 52
  • 435
  • 836
Nipun Tyagi
  • 878
  • 9
  • 26