-1

Possible Duplicate:
What RSS parser should I use in PHP?

Here is the code:

<item>
<title><![CDATA[OLK: The statement of shareholders for shares sale and for shares purchase]]></title>
<link>http://www.nasdaqomxbaltic.com/market/?pg=news&amp;news_id=250910</link>
<description><![CDATA[<pre></pre>]]></description>
<pubDate>2011-08-12 16:25:00</pubDate>
<guid>250910</guid>
</item>
<item>
<title><![CDATA[ZMP: Pranešimas apie sandorius susijusį su emitento vertybiniais popieriais]]></title>
<link>http://www.nasdaqomxbaltic.com/market/?pg=news&amp;news_id=250907</link>
<description><![CDATA[<pre></pre>]]></description>
<pubDate>2011-08-12 16:12:00</pubDate>
<guid>250907</guid>
</item>

And I need to get the values OLK, ZMP which are between <title><![CDATA[ and :. What is the fastest and the most efficient way to do this in php regex? and why is CDATA here? NOTE: Im getting the news_id= too.

Community
  • 1
  • 1
Josh
  • 37
  • 5
  • 3
    Do not use regex for this, use a proper XML parser – Pekka Aug 12 '11 at 13:48
  • 4
    obligatory http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – austinbv Aug 12 '11 at 13:48
  • is it faster or is it just easier? Cause i need speed only. – Josh Aug 12 '11 at 13:49
  • 1
    @austin the accepted answer is wrong. Since all modern languages use PCRE, Regex can very much parse HTML. Please do not link to the answer anymore. The only reason not to parse X(HT)ML with Regex is because there is parsers readily available and they are more robust and reliable for this purpose than brittle regex. – Gordon Aug 12 '11 at 13:51
  • just out of curiosity, why not use an XML parser, that way you can get just the element you're looking for and not have the whole document to worry about. – Mike D Aug 12 '11 at 13:52
  • @josh It is faster better and the right way to handle an xml document... – austinbv Aug 12 '11 at 13:52

4 Answers4

1

You should use XML parser (eg. SimpleXML) to gain access to the tag content, and then use regular expressions on the content of the tag.

This is the most efficient solution, because:

  • XML parser is the most efficient way to parse XML documents,
  • if you really need to use regular expression, you should use it on data contained within CDATA,

When it comes to part of your question about CDATA, you can see more info about it here.

Tadeck
  • 132,510
  • 28
  • 152
  • 198
0

This is a great guide to parse xml propperly with php. http://www.kirupa.com/web/xml_php_parse_beginner.htm It is what I used when I started with php to figure out how the xml parser works.

austinbv
  • 9,297
  • 6
  • 50
  • 82
0

Consider using an XML parser, CDATA allows you to use special characters inside the value. If you insist using regex, try following:

/<title><!\[CDATA\[OLK:\s*(.*?)\]\]/
Cem Kalyoncu
  • 14,120
  • 4
  • 40
  • 62
0

If you really want to go regex then i would reccomend look-ahead and look-behind zero-width assertions. They allow you to state and expression as a start and finish of the match, but it won't be included in the result.

Gaijinhunter
  • 14,587
  • 4
  • 51
  • 57