2

I'm trying to get the four or five things that happened on this day in history, and add a plaintext representation of that into an array in PHP.

So far, I'm using this code:

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://en.wikipedia.org/w/api.php?action=featuredfeed&feed=onthisday&feedformat=rss');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, '3');
curl_setopt($ch, CURLOPT_USERAGENT, 'My random user agent'); // Needed for Wikipedia to prevent IP blocking
$contents = trim(curl_exec($ch));
curl_close($ch);

$xml = simplexml_load_string($contents);
$json = json_encode($xml);
$array = json_decode($json, true);


$noOfDays = count($array['channel']['item']);
$r = $noOfDays - 1;
$input = $array['channel']['item'][$r]['description'];

I know this is not very dyamic and efficient, but one person is going to be calling this page once a day, so it's not terribly important.

At this point, $input contains a block of HTML, which looks something like this:

<p><b><a href="/wiki/April_6" title="April 6">April 6</a></b>: <b><a href="/wiki/Good_Friday" title="Good Friday">Good Friday</a></b> (Western Christianity, 2012); <b><a href="/wiki/Fast_of_the_Firstborn" title="Fast of the Firstborn">Fast of the Firstborn</a></b> begins at dawn and <b><a href="/wiki/Passover" title="Passover">Passover</a></b> begins at sunset (Judaism, 2012)
</p>
<div style="float:right;margin-left:0.5em">
<p><a href="/wiki/File:Sir_Arthur_Wellesley,_1st_Duke_of_Wellington.png" class="image" title="Arthur Wellesley, the Earl of Wellington"><img alt="Arthur Wellesley, the Earl of Wellington" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/83/Sir_Arthur_Wellesley%2C_1st_Duke_of_Wellington.png/78px-Sir_Arthur_Wellesley%2C_1st_Duke_of_Wellington.png" width="78" height="100" /></a>
</p>
</div>
<li style="-moz-float-edge: content-box">
<a href="/wiki/1250" title="1250">1250</a> – <a href="/wiki/Seventh_Crusade" title="Seventh Crusade">Seventh Crusade</a>: Egyptian <a href="/wiki/Ayyubid" title="Ayyubid" class="mw-redirect">Ayyubids</a> <b><a href="/wiki/Battle_of_Fariskur" title="Battle of Fariskur">annihilated the crusader army</a></b> and captured King <a href="/wiki/Louis_IX_of_France" title="Louis IX of France">Louis&#160;IX of France</a> as a hostage.
<li style="-moz-float-edge: content-box">
<a href="/wiki/1320" title="1320">1320</a> – The <b><a href="/wiki/Declaration_of_Arbroath" title="Declaration of Arbroath">Declaration of Arbroath</a></b>, a declaration of <a href="/wiki/Scottish_independence" title="Scottish independence">Scottish independence</a>, was adopted.
<li style="-moz-float-edge: content-box">
<a href="/wiki/1812" title="1812">1812</a> – <a href="/wiki/Peninsular_War" title="Peninsular War">Peninsular War</a>: After a <b><a href="/wiki/Siege_of_Badajoz_(1812)" title="Siege of Badajoz (1812)">three-week siege</a></b>, the <a href="/wiki/Anglo-Portuguese_Army" title="Anglo-Portuguese Army">Anglo-Portuguese Army</a>, under the <a href="/wiki/Arthur_Wellesley,_1st_Duke_of_Wellington" title="Arthur Wellesley, 1st Duke of Wellington">Earl of Wellington</a> <i>(pictured)</i>, captured <a href="/wiki/Badajoz" title="Badajoz">Badajoz</a>, Spain and forced the surrender of the French garrison.
<li style="-moz-float-edge: content-box">
<a href="/wiki/1947" title="1947">1947</a> – The <a href="/wiki/1st_Tony_Awards" title="1st Tony Awards">first</a> <b><a href="/wiki/Tony_Award" title="Tony Award">Tony Awards</a></b>, recognizing achievement in live American <a href="/wiki/Theatre" title="Theatre">theatre</a>, were handed out at the <a href="/wiki/Waldorf-Astoria_Hotel" title="Waldorf-Astoria Hotel">Waldorf-Astoria Hotel</a> in <a href="/wiki/New_York_City" title="New York City">New York City</a>.
<li style="-moz-float-edge: content-box">
<a href="/wiki/2008" title="2008">2008</a> – Egyptian workers staged <b><a href="/wiki/2008_Egyptian_general_strike" title="2008 Egyptian general strike">an illegal general strike</a></b>, two days before <a href="/wiki/Egyptian_municipal_elections,_2008" title="Egyptian municipal elections, 2008">key municipal elections</a>.
</li>
</ul>
<p>More anniversaries: <span class="nowrap"><a href="/wiki/April_5" title="April 5">April 5</a> &#8211;</span> <span class="nowrap"><b><a href="/wiki/April_6" title="April 6">April 6</a></b> &#8211;</span> <span class="nowrap"><a href="/wiki/April_7" title="April 7">April 7</a></span>
</p>
<div style="text-align: right;" class="noprint"><span class="nowrap"><b><a href="/wiki/Wikipedia:Selected_anniversaries/April" title="Wikipedia:Selected anniversaries/April">Archive</a></b> &#8211;</span> <span class="nowrap"><b><a href="https://lists.wikimedia.org/mailman/listinfo/daily-article-l" class="extiw" title="mail:daily-article-l">By email</a></b> &#8211;</span> <span class="nowrap"><b><a href="/wiki/List_of_historical_anniversaries" title="List of historical anniversaries">List of historical anniversaries</a></b></span></div>
<div style="text-align: right;"><small>It is now <span class="nowrap">April 6, 2012</span> (<a href="/wiki/Coordinated_Universal_Time" title="Coordinated Universal Time">UTC</a>) &#8211; <span class="plainlinks" id="purgelink"><span class="nowrap"><a class="external text" href="//en.wikipedia.org/w/index.php?title=MediaWiki:Ffeed-onthisday-transcludeme&amp;action=purge">Refresh this page</a></span></span></small></div>

The only thing that I'm interested in are the bits between each <li style="-moz-float-edge: content-box">

I've got no idea why they didn't close these <li> tags properly, but there you go.

So the essence of what I want to is take the actual information, strip away the links and add each one into an array, which should look something like this:

Array (
    [0] => 1250 – Seventh Crusade: Egyptian Ayyubids annihilated the crusader army and captured King Louis&#160;IX of France as a hostage.
    [1] => Next one...
    [2] => And another...
)

There's also a slight problem regarding the &#160; at the end of this line. How would I translate that into plaintext? I have a feeling HTML parsing may be the answer.

I've already tried regex and HTML parsing, but as the tags don't close I've had some difficulty doing this.

Any suggestions?

Alfo
  • 4,801
  • 9
  • 38
  • 51
  • 2
    Older html specs didn't require `
  • ` tags to be closed. There was an implicit close anything a fres `
  • ` was encountered. Exactly the same way that `

    ` didn't require a matching closer.

  • – Marc B Apr 06 '12 at 15:35
  • 1
    @MarcB, HTML5 allows for [optional tags](http://dev.w3.org/html5/spec/syntax.html#optional-tags): ` Hello World!` is a valid HTML5 document. – zzzzBov Apr 06 '12 at 15:44
  • 6
    @Alfo: don't use regex for parsing html:http://stackoverflow.com/a/1732454/118068 Use DOM instead. It'll save you major headaches. – Marc B Apr 06 '12 at 15:46