6

Is there any function that I can use to parse any string to ensure it won't cause xml parsing problems? I have a php script outputting a xml file with content obtained from forms.

The thing is, apart from the usual string checks from a php form, some of the user text causes xml parsing errors. I'm facing this "’" in particular. This is the error I'm getting Entity 'rsquo' not defined

Does anyone have any experience in encoding text for xml output?

Thank you!


Some clarification: I'm outputting content from forms in a xml file, which is subsequently parsed by javascript.

I process all form inputs with: htmlentities(trim($_POST['content']), ENT_QUOTES, 'UTF-8');

When I want to output this content into a xml file, how should I encode it such that it won't throw up xml parsing errors?

So far the following 2 solutions work:

1) echo '<content><![CDATA['.$content.']]></content>';

2) echo '<content>'.htmlspecialchars(html_entity_decode($content, ENT_QUOTES, 'UTF-8'),ENT_QUOTES, 'UTF-8').'</content>'."\n";

Are the above 2 solutions safe? Which is better?

Thanks, sorry for not providing this information earlier.

Lyon
  • 7,354
  • 10
  • 33
  • 46
  • 1
    I would use an XML parser to see if an XML parser doesn’t choke on the input. – Gumbo Jun 29 '10 at 16:11
  • The problem here is that XML does only know few entities that actually specify character references. (See http://www.w3.org/TR/xml/#sec-predefined-ent) – Gumbo Jun 29 '10 at 16:17

8 Answers8

7

You take it the wrong way - don't look for a parser which doesn't give you errors. Instead try to have a well-formed xml.

How did you get &rsquo; from the user? If he literally typed it in, you are not processing the input correctly - for example you should escape & to &amp;. If it is you who put the entity there (perhaps in place of some apostrophe), either define it in DTD (<!ENTITY rsquo "&x2019;">) or write it using a numeric notation (&#x2019;), because almost every of the named entities are a part of HTML. XML defines only a few basic ones, as Gumbo pointed out.

EDIT based on additions to the question:

  • In #1, you escape the content in the way that if user types in ]]> <°)))><, you have a problem.
  • In #2, you are doing the encoding and decoding which result in the original value of the $content. the decoding should not be necessary (if you don't expect users to post values like &amp; which should be interpreted like &).

If you use htmlspecialchars() with ENT_QUOTES, it should be ok, but see how Drupal does it.

Krab
  • 2,118
  • 12
  • 23
  • 1
    thanks Krab. What i do with user inputs is put them through this: `htmlentities($_POST['content'], ENT_QUOTES, 'UTF-8');`. sleepynate's suggestion to use html_entity_decode fixed `’` as it converted it back..but then i had problems with `&`. What should I do? Is this `htmlspecialchars(html_entity_decode($content, ENT_QUOTES, 'UTF-8'),ENT_QUOTES, 'UTF-8')` sufficient to ensure future user inputs won't cause problems with my xml file? I need the xml file to be error-free since a javascript function is parsing it. – Lyon Jun 29 '10 at 16:32
  • Is there any reason you must use htmlentities() and not htmlspecialchars()? – Krab Jun 29 '10 at 16:43
  • no particular reason actually..would htmlspecialchars() suffice to process all user inputs? when would one use htmlentities() then? – Lyon Jun 29 '10 at 16:48
  • so actually, if i'm already processing, saving and outputting inputs all in utf-8, i wouldn't need htmlentities, and htmlspecialchars would suffice? Thanks! – Lyon Jun 29 '10 at 16:53
  • Maybe you could use htmlentities() to produce a document containing unicode characters while preserving plain ASCII encoding. But this is only true if you it also encodes characters to numeric entities. I don't know. – Krab Jun 29 '10 at 16:55
  • thanks Krab. really appreciate your help in this. I've decided to change htmlentities to htmlspecialchars thus negating the need to decode/encode any content before outputting to xml. thanks! :) – Lyon Jun 29 '10 at 17:11
  • Using `CDATA` results in _well-formed_ XML. There's no reason to munge HTML character entities in attempt to bend your content to XML. Hey, anyone remember how well XHTML worked out for us? – vhs Feb 20 '19 at 05:44
5
html_entity_decode($string, ENT_QUOTES, 'UTF-8')
da5id
  • 9,100
  • 9
  • 39
  • 53
sleepynate
  • 7,926
  • 3
  • 27
  • 38
  • that resolves the `’` error but brings up `&` errors now? If i change `&` to `&` it fixed the error but how can I decode everything properly? – Lyon Jun 29 '10 at 16:22
  • yep. i'm outputting in utf-8. my xml output starts with `echo ''."\n";` thanks – Lyon Jun 29 '10 at 16:42
4

Enclose the value within CDATA tags.

<message><![CDATA[&rsquo;]]></message>

From the w3schools site:

Characters like "<" and "&" are illegal in XML elements.

"<" will generate an error because the parser interprets it as the start of a new element.

"&" will generate an error because the parser interprets it as the start of an character entity.

Some text, like JavaScript code, contains a lot of "<" or "&" characters. To avoid errors script code can be defined as CDATA.

Everything inside a CDATA section is ignored by the parser.

Joseph
  • 1,988
  • 14
  • 21
  • this is a good solution, though I shudder at unescaped CDATA being used somewhere else in the code further down the line (for example, to be put in to a database), though that's not directly related to the problem at hand. – sleepynate Jun 29 '10 at 16:47
  • This is the correct answer where RSS is concerned unless you're using XHTML DTD. Especially useful when outputting feeds using [`content:encoded`](http://purl.org/rss/1.0/modules/content/) in a [`CDATASection`](https://devdocs.io/dom/cdatasection). – vhs Feb 20 '19 at 05:47
3

The problem is that your htmlentities function is doing what it should - generating HTML entities from characters. You're then inserting these into an XML document which doesn't have the HTML entities defined (things like &rsquo; are HTML-specific).

The easiest way to handle this is keep all input raw (i.e. don't parse with htmlentities), then generate your XML using PHP's XML functions.

This will ensure that all text is properly encoded, and your XML is well-formed.

Example:

$user_input = "...<>&'";

$doc = new DOMDocument('1.0','utf-8');

$element = $doc->createElement("content");
$element->appendChild($doc->createTextNode($user_input));

$doc->appendChild($element);
porges
  • 30,133
  • 4
  • 83
  • 114
  • thanks Porges! Currently i'm just echo-ing out into a xml file. I'll use PHP's XML functions to properly produce a xml document. :) – Lyon Jul 01 '10 at 15:14
1

I had a similar problem that the data i needed to add to the XML was already being returned by my code as htmlentities() (not in the database like this).

i used:

$doc = new DOMDocument('1.0','utf-8');    
$element = $doc->createElement("content");    
$element->appendChild($doc->createElement('string', htmlspecialchars(html_entity_decode($string, ENT_QUOTES, 'UTF-8'), ENT_XML1, 'UTF-8')));
$doc->appendChild($element);

or if it was not already in htmlentities() just the below should work

$doc = new DOMDocument('1.0','utf-8');

$element = $doc->createElement("content");       
$element->appendChild($doc->createElement('string', htmlspecialchars($string, ENT_XML1, 'UTF-8')));
$doc->appendChild($element);

basically using htmlspecialchars with ENT_XML1 should get user imputed data into XML safe data (and works fine for me):

htmlspecialchars($string, ENT_XML1, 'UTF-8');
Ford
  • 537
  • 6
  • 20
0

This worked for me. Some one facing the same issue can try this.

htmlentities($string, ENT_XML1)

With special characters conversion.

htmlspecialchars(htmlentities($string, ENT_XML1))

Mukesh Joshi
  • 9
  • 1
  • 2
0

Use htmlspecialchars() will solve your problem. See the post below.

PHP - Is htmlentities() sufficient for creating xml-safe values?

Community
  • 1
  • 1
Tahir Yasin
  • 11,489
  • 5
  • 42
  • 59
-1
htmlspecialchars($trim($_POST['content'], ENT_XML1, 'UTF-8');

Should do it.

tfont
  • 10,891
  • 7
  • 56
  • 52