Can't manage to write regex for this

Question

Possible Duplicate:
What RSS parser should I use in PHP?

Here is the code:

<item>
<title><![CDATA[OLK: The statement of shareholders for shares sale and for shares purchase]]></title>
<link>http://www.nasdaqomxbaltic.com/market/?pg=news&amp;news_id=250910</link>
<description><![CDATA[<pre></pre>]]></description>
<pubDate>2011-08-12 16:25:00</pubDate>
<guid>250910</guid>
</item>
<item>
<title><![CDATA[ZMP: PraneÅ¡imas apie sandorius susijusÄ¯ su emitento vertybiniais popieriais]]></title>
<link>http://www.nasdaqomxbaltic.com/market/?pg=news&amp;news_id=250907</link>
<description><![CDATA[<pre></pre>]]></description>
<pubDate>2011-08-12 16:12:00</pubDate>
<guid>250907</guid>
</item>

And I need to get the values OLK, ZMP which are between <title><![CDATA[ and :. What is the fastest and the most efficient way to do this in php regex? and why is CDATA here? NOTE: Im getting the news_id= too.

obligatory http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — austinbv, Aug 12 '11 at 13:48
@austin the accepted answer is wrong. Since all modern languages use PCRE, Regex can very much parse HTML. Please do not link to the answer anymore. The only reason not to parse X(HT)ML with Regex is because there is parsers readily available and they are more robust and reliable for this purpose than brittle regex. — Gordon, Aug 12 '11 at 13:51
just out of curiosity, why not use an XML parser, that way you can get just the element you're looking for and not have the whole document to worry about. — Mike D, Aug 12 '11 at 13:52
@josh It is faster better and the right way to handle an xml document... — austinbv, Aug 12 '11 at 13:52

score 1 · Accepted Answer · answered Aug 12 '11 at 13:51

You should use XML parser (eg. SimpleXML) to gain access to the tag content, and then use regular expressions on the content of the tag.

This is the most efficient solution, because:

XML parser is the most efficient way to parse XML documents,
if you really need to use regular expression, you should use it on data contained within CDATA,

When it comes to part of your question about CDATA, you can see more info about it here.

score 0 · Answer 2 · answered Aug 12 '11 at 13:51

0

This is a great guide to parse xml propperly with php. http://www.kirupa.com/web/xml_php_parse_beginner.htm It is what I used when I started with php to figure out how the xml parser works.

answered Aug 12 '11 at 13:51

austinbv

9,297
6
50
82

score 0 · Answer 3 · answered Aug 12 '11 at 13:51

0

Consider using an XML parser, CDATA allows you to use special characters inside the value. If you insist using regex, try following:

/<title><!\[CDATA\[OLK:\s*(.*?)\]\]/

answered Aug 12 '11 at 13:51

Cem Kalyoncu

14,120
4
40
62

score 0 · Answer 4 · answered Aug 12 '11 at 13:52

0

If you really want to go regex then i would reccomend look-ahead and look-behind zero-width assertions. They allow you to state and expression as a start and finish of the match, but it won't be included in the result.

answered Aug 12 '11 at 13:52

Gaijinhunter

14,587
4
51
57

Can't manage to write regex for this

4 Answers4