3

I'm allowing some user input on my website, that later is read in XML. Every once in a while I get these weird single or double quotes like this ”’. These are directly copied from the source that broke my XML. I'm wondering if there is an easy way to correct these types of characters in my xml. htmlentities did not seem to touch them.

Where do these characters come from? I'm not even sure how I'd go about typing them out unintentionally.

EDIT- I forgot to clarify these quotes are not being used in attributes, but in the following way:

<SomeTag>User’s Input</SomeTag>
Robert Harvey
  • 178,213
  • 47
  • 333
  • 501
mouser58907
  • 797
  • 2
  • 10
  • 21
  • couldn't you just use a simple string replace? – annonymously Feb 03 '12 at 01:33
  • How do these "break" your XML? How are you outputting them? If those are a problem, then any non-ASCII characters probably are. – deceze Feb 03 '12 at 01:33
  • Well, basically it failed to parse on both the iphone and android. I'm just worried there are more characters that could break it. Otherwise a simple replace would suffice. – mouser58907 Feb 03 '12 at 01:40
  • "Failed to parse" in what way? What are the error messages? I guess you simply have an encoding issue, like specifying that your XML file is encoded in UTF-8, but you're actually outputting these characters as latin1 encoded. That's a general encoding issue you need to solve, it's not specific to these characters. – deceze Feb 03 '12 at 02:03
  • These characters cause problems when parsing XML into Flash too, if you haven't remembered to embed all of the edge case characters, they don't turn up on screen :/ – danjah Feb 03 '12 at 02:55

5 Answers5

2

Don't disallow and/or modify foreign characters; that's just annoying for your users! This is just an encoding issue. I don't know what parser you're using to read the XML, but if it's reasonably sophisticated, you can solve your problem by including the following encoding pragma at the top of your XML files:

<?xml version="1.0" encoding="UTF-8"?>

There may also be a UTF-8 option in the parser's API.

Edit: I just read that you're reading the XML directly in a browser. Most browsers listen to the encoding pragma!

Edit 2: Apparently, those quotes aren't even legal in UTF-8, so ignore what I said above. Instead, you might find what you're looking for here, where a similar problem is being discussed.

Community
  • 1
  • 1
Milosz
  • 2,924
  • 3
  • 22
  • 24
  • 1
    But what if your editor changed encoding="UTF-8" to Encoding=“UTF-8” – James Anderson Feb 03 '12 at 08:33
  • They're perfectly legal in UTF-8. If they weren't, we wouldn't be able to use them. It may be that they're being used where `"` or `'` is mandatory for attribute value delimiting - the original question isn't clear on this - but otherwise I think you’re correct here. – Jon Hanna Feb 03 '12 at 12:13
  • They are valid characters in utf-8. But they are not valid characters for enclosing an XML attribute value. So they can quite happliy appear as part of the contents of an element or attribute, but, they cannot be used to delimit the value of an attribute. For an attribute definition like encoding="utf-8" the quotes must be dumb. – James Anderson Feb 06 '12 at 01:45
  • @JamesAnderson I edited my original post to point out that they are not being used to enclose attributes. If they are valid utf-8 characters, I should be able to correct my encoding. I'll check into it and see if that works. – mouser58907 Feb 06 '12 at 19:10
2

Are these quotes being used in text content, or to delimit attributes? For attribute delimiters, XML requires typewriter quotes (single or double). Microsoft and other word-processing applications often try to be smart and replace typewriter quotes with typographical quotes, which is almost certainly the answer to the question "where are they coming from?".

If you need to get rid of them, a simple global replace using a text editor will do the job fine.

But you might try to work out first why they are causing a problem. Perhaps your data flow can't handle ANY non-ASCII characters, in which case that's a deeper problem that you really ought to fix (it would typically imply some unwanted transcoding is happing somewhere along the line).

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
1

Stay away from MicroSoft Office apps. Word, Excel etc. have a nasty habit of replacing matching pairs of single quotes and double quotes with non-standard "smart-quotes".

These quote characters are truly non-standard and never made it into the official latin-1 character set. All the MS Office apps "helpfully" replace standard quote characters with these abominations.

Just google for "undoing smatquotes" or "convert smartquotes back" for hints tips and regexes to get rid of these.

James Anderson
  • 27,109
  • 7
  • 50
  • 78
  • So that is where these come from? As mentioned by Milosz, I'd hate to modify user's input but I don't see many options except to replace them. – mouser58907 Feb 03 '12 at 01:52
  • How do they differ from the standard `“‘’”`? Surely they were PUA they'd not look like quotes in another context. – Jon Hanna Feb 03 '12 at 02:24
  • 2
    That's not really an answer, is it? These smart quotes are very well part of Unicode (U+201D and U+2019), otherwise they couldn't even be displayed on this page. And as normal Unicode characters, they can very well be embedded into XML documents. While yes, you often *want* to replace smart quotes with regular quotes for various reasons, there's no technical reason to outright "avoid MicroSoft Office" because of them. – deceze Feb 03 '12 at 04:34
  • Its the "Latin -1" eight bit encoding which is non-standard and the cause of many crashes. To state the problem again various MS tools blindly substitute dumb quotes for smart quotes. However the XML standard specifies that certain attribute values be enclosed in dumb quotes -- using the fancy quote characters to enclose attribute strings is not valid XML. – James Anderson Feb 03 '12 at 08:24
  • Sure, absolutely. But we don't even know the problem is caused by these quotes being used in XML attributes! It sounds more like he's wrapping content in XML, and these quotes are part of the content (which should work just fine), not part of the XML. So to me it sounds more like an encoding problem, most likely Latin1 embedded in what should be a UTF-8 encoded document. Unfortunately the OP hasn't specified that, despite my asking. – deceze Feb 03 '12 at 10:31
  • Also, the type of quotes does not necessarily have anything to do with the encoding. These fancy quotes can be encoded in both Latin1 and UTF-8, and probably a few other encodings as well. – deceze Feb 03 '12 at 11:52
  • @deceze They aren't in "Latin 1" meaning ISO 8859-1, they are in "Latin 1" meaning Windows-1252, of which the former is defined by a standard and the latter by Microsoft. However, the standards on defining which encoding is used in a given case allow for either to be used (on the other hand, for neither to necessarily be supported). Granted this has caused problems, but you are of course correct in that they are both standard characters (and indeed those recommended for use with the English language `"` and `'` are for code and legacy systems that don't fix user input like Word does. – Jon Hanna Feb 03 '12 at 15:04
  • @Jon Just to clarify, because I seem to have missed where this information came into the discussion: how did anyone determine what the quotes the OP is asking about are *encoded* in to begin with? He's just talking about the "characters breaking his XML", which could really mean anything without clarification, which was never given. – deceze Feb 04 '12 at 03:37
  • @deceze Well, they're encoded at some point when they come from the user to the website. I agree though that we don't know if encoding is anything to do with the problem. Especially since we don't know if they're being used as attribute delimiters which would be incorrect whatever the encoding. – Jon Hanna Feb 04 '12 at 09:49
1

If the input string is UTF-8 encoded, maybe you need to specify that to htmlentities(), for example:

$html = htmlentities( '”’', ENT_COMPAT, "utf-8" );
echo $html;

For me gives:

&rdquo;&rsquo;

whereas

$html = htmlentities( '”’' );
echo $html;

gets confused:

&acirc;??&acirc;??

If the input string is non-UTF-8, then you'd need to adjust the encoding arg for htmlentities() accordingly.

martin clayton
  • 76,436
  • 32
  • 213
  • 198
0

Use

 $s =    'User’s Input';
    $descriptfix = preg_replace('/[“”]/','\"',$s);
    $descriptfix = preg_replace('/[‘’]/','\'',$descriptfix);
echo    "<SomeTag>htmlentities($s)</SomeTag>";
ConRockets
  • 46
  • 7