1

The feed in question is: http://api.inoads.com/snowstorm/feed.xml

Here is the PHP code I am using for the generation:

<?php

$database =  'xxxx';
$dbconnect = mysql_pconnect('xxxx', 'xxxx', 'xxxx');
mysql_select_db($database, $dbconnect);

$query = "SELECT * FROM the_queue WHERE id LIKE '%'    ORDER BY id DESC LIMIT 25";
$result = mysql_query($query, $dbconnect);

while ($line = mysql_fetch_assoc($result))
        {
            $return[] = $line;
        }

$now = date("D, d M Y H:i:s T");

$output = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
            <rss version=\"2.0\">
                <channel>
                    <title>The Queue</title>
                    <link>http://readapp.net</link>
                    <description>A curated reading list.</description>
                    <language>en-us</language>
                    <pubDate>$now</pubDate>
                    <lastBuildDate>$now</lastBuildDate>
            ";

foreach ($return as $line)
{
    $output .= "<item><title>".htmlspecialchars($line['title'])."</title>
    <description>".htmlspecialchars($line['description'])."</description>
                    <link>".htmlspecialchars($line['link'])."</link>
                    <pubDate>".htmlspecialchars($line['pubDate'])."</pubDate>
                </item>";
}
$output .= "</channel></rss>";

$fh = fopen('feed.xml', 'w');
fwrite($fh, $output);
?>

What might be causing the error?

Here's a link from the feed validator: http://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Fapi.inoads.com%2Fsnowstorm%2Ffeed.xml

Bart
  • 19,692
  • 7
  • 68
  • 77
mmackh
  • 3,550
  • 3
  • 35
  • 51
  • Welcome to Stack Overflow! You are not doing any error checking in your queries. You *need* to do that after a `mysql_query()` call. Otherwise, your script will break if the query fails. How to do this is outlined in the [manual on `mysql_query()`](http://php.net/mysql_query) or in this [reference question.](http://stackoverflow.com/questions/6198104/reference-what-is-a-perfect-code-sample-using-the-mysql-extension) – Pekka Dec 10 '11 at 09:24
  • What string encoding is your data in? You need to specify that in the `` tag. For example `` – Abhi Beckert Dec 10 '11 at 09:26
  • @AbhiBeckert UTF-8 - I've revised the post above to reflect this – mmackh Dec 10 '11 at 09:41
  • @deceze There are issues with quotes and question marks - I've updated the post to show this. – mmackh Dec 10 '11 at 09:43
  • The mysql extension is outdated and on its way to deprecation. New code should use mysqli or PDO, both of which have important advantages, such as support for prepared statements. – outis Dec 10 '11 at 09:46
  • @outis Could you please point me to a sample /some documentation? – mmackh Dec 10 '11 at 09:51
  • @mmackh: for [mysqli](http://php.net/mysqli) and [PDO](http://php.net/PDO)? They're standard PHP extensions, and covered in the manual. If you want a tutorial, try ["Writing MySQL Scripts with PHP and PDO"](http://www.kitebird.com/articles/php-pdo.html). – outis Dec 10 '11 at 10:06
  • @outis while it's true everyone should start moving towards the modern API's, the old ones are perfectly functional and will not have anything to do with the character encoding issue described here. *If it aint broke, don't fix it.* – Abhi Beckert Dec 10 '11 at 19:57
  • @AbhiBeckert: how I hate that phrase. Anything that's [obsolete](http://stackoverflow.com/q/763492/90527) still works, but has been replaced with something that works better. The obsolete thing can be said to be functional, limping along; if it unnecessarily impacts development, it's broken. There's a cost to updating a system, so in some situations it may make more financial sense to stick to the old tool. However, this is new code, not legacy. There are tangible benefits to using mysqli or PDO over mysql. – outis Feb 14 '12 at 23:53

5 Answers5

3

You said the XML file is UTF-8, but when I download it and open it in my text editor it auto-detects the windows latin1 encoding, and the quotes display perfectly.

If I force my text editor to use UTF-8, it shows an error message because there are illegal characters for the UTF-8 encoding.

Therefore, your data is not UTF-8, it is latin1. You need to find out exactly where that's happening. It could be any one, or several of:

is the HTML page where the content is typed in by the user set to UTF-8?

If not, the browser will be sending latin1 quotes. To fix this, the first tag in your <head> needs to be:

<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  ...
</head>

is every browser correctly respecting your UTF-8 setting in that page's HTML?

If you specify UTF-8 and the page contains characters illegal in that encoding, some browsers might decide to use a different encoding despite the <meta> tag. How to check this is different in every browser.

is the MySQL connection when inserting into the database set to use UTF-8?

You need to be using UTF-8 here, or else MySQL may try to convert the encoding for you, often corrupting them. Set the encoding with:

$database =  'xxxx';
$dbconnect = mysql_pconnect('xxxx', 'xxxx', 'xxxx');
mysql_select_db($database, $dbconnect);
mysql_query('SET NAMES utf8', $dbconnect);

is the MySQL table (and individual column) set to use UTF-8?

Again, to avoid MySQL doing it's own buggy conversion, you need to make sure it's using UTF-8 for the table and also the individual comment. Do a structure dump of the database and check for:

CREATE TABLE `the_queue` (
  ...
) ... DEFAULT CHARSET=utf8;

And also make sure there isn't something like this on any of the columns:

`description` varchar(255) CHARACTER SET latin1,

is the MySQL connection when reading the database set to use UTF-8?

Your read connection also needs to be utf8. So double check that.

are you doing anything in the PHP that cannot handle UTF-8?

PHP has some functions which cannot be used on utf-8 strings, as it will corrupt them. One of those functions is htmlentities() so make sure you always use htmlspecialchars(). The easiest way to test this is to start commenting out big chunks of your code to see where the encoding is breaking.

Abhi Beckert
  • 32,787
  • 12
  • 83
  • 110
1

The point of htmlentities is to replace all characters that have define HTML character entities with those entities. If you really don't want any character entities (as your desired result suggests), don't use htmlentities.

By default, htmlentities uses the latin-1 charset, so it chokes on the smart quotes (indeed, all multibyte characters), which is where you see the question marks. One fix is to use htmlspecialchars to convert a much more limited set of characters (&, <, >, ' and "). This will still convert the double quotes because, well, that's the point of htmlspecialchars, unless you specify the ENT_NOQUOTES as the second argument. Another fix is to specify the character set as the third argument (this isn't exclusive of using htmlspecialchars).

The fourth argument to either specifies whether or not to encode already encoded characters. Whether or not do double-encode depends on the source data.

$line['description'] = '"Dave, stop. Stop, will you? Stop, Dave. Will you stop, Dave?” ... “Dave, my mind is going,” HAL says, forlornly. “I can feel it. I can feel it.”';

echo "<description>" . htmlspecialchars($line['description'], ENT_NOQUOTES, 'UTF-8', false) . "</description>";

See also:

outis
  • 75,655
  • 22
  • 151
  • 221
  • I have tried this, but I'm getting the following error: This feed does not validate. 'utf8' codec can't decode byte 0x94 in position 606: unexpected code byte (maybe a high-bit character?) – mmackh Dec 10 '11 at 10:01
  • I receive the following warning: expects parameter 2 to be long – mmackh Dec 10 '11 at 10:09
  • What would be the reason if this code resulted in only this particular item's description in the feed being empty? – mmackh Dec 10 '11 at 10:19
  • The error message about decoding the byte stream might be a bug in the validator. Note if you paste the feed document into the "Validate by direct input" form, no invalid character error is generated. Alternatively, use `htmlspecialchars` to replace the smart quotes with named character entities (`ENT_NOQUOTES` only applies to plain single and double quotes). However, some feed readers are reported to have problems with named character entities; it might be better to ignore the validator error message. – outis Dec 10 '11 at 10:20
  • RE: "warning: expects parameter 2 to be long": make sure you're entering exactly `ENT_NOQUOTES`, unquoted. Anything else (quotes, typo) and it will be passed as a string. – outis Dec 10 '11 at 10:23
  • I've written this for my iPad app, which has the strictest of xml parsers. That validator warning I'm receiving is indicating something is wrong, which in turn instantly crashes the whole app - all the other RSS feeds work great, just not mine. – mmackh Dec 10 '11 at 10:25
  • 1
    Now we get to the real issue. This question suffers from the [XY problem](http://meta.stackexchange.com/questions/66377/). What you should do is create a [minimal test case](http://sscce.org/). Forget the code that generates the feed and create the smallest possible, static RSS file that causes your app to crash. Then create the smallest possible app that crashes on the static RSS feed. With that focused example, you might see the actual reason why it's crashing. If not, create a new question asking why the parser crashes on your feed, using your minimal sample. – outis Dec 10 '11 at 10:53
  • I've written the content of $output into a .xml file here: http://api.inoads.com/snowstorm/feed.xml – mmackh Dec 10 '11 at 11:17
1

There is one problem here:

$output = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
...

There is a string containing "?>". This is the finalization marker for php. It will give you an error.

You can avoid these problems this way:

$output = "<?xml version=\"1.0\" encoding=\"UTF-8\"?".">
...
jap1968
  • 7,747
  • 1
  • 30
  • 37
  • 1
    The PHP parser is perfectly capable of handling PHP close tags embedded in strings, whether single-quoted, double-quoted, nowdoc or heredoc. – outis Dec 10 '11 at 11:12
0

Problem is that you are holding this string with quotes in database (as I assume). If it is true, PHP is removing quotes (which is proper), because of not causing bugs (SQL injection ex). So you have to remove quotes in DB and while generating XML file just add them. It is the simplest in my opinion. And try avoid double quotes ". You should use single ones '. In double PHP parser additionally checks what is in. So try to remove qoutes from DB and add them while generating XML. Should help.

bkowalikpl
  • 817
  • 5
  • 11
  • No, the replacement of `"Dave` with `"Dave` is just a matter of what way escaping is done for the XML, and the two of them are equivalent. The replacement of `Dave?”` with `Dave??` is more likely an encoding matter, and since `”` isn't treated specially by either PHP or SQL, not a matter of any injection avoidance. – Jon Hanna Dec 10 '11 at 09:47
0

Another error that you have it´s the format of the date. The date must be in format RFC-822, it must be in a format like this "Wed, 02 Oct 2002 08:00:00 EST", not "July/August 2008".

josegil
  • 365
  • 1
  • 8