1

AN ongoing issue for over a year, That I though I had corrected but has evolved into a monster.

I move large amounts of data between sites using XML generated on PHP systems. Mainly text I ran into some basic XML items that broke the transfer so I used this code of all XML values.

$value=str_replace("'","'",$value);
print '<'.$key.'>';
print htmlspecialchars($value, ENT_XML1 | ENT_QUOTES, 'UTF-8');
print '</'.$key.'>'; 

$key being the field and this works perfectly for all data except for anyting containing an accent such as piñata. A value with the ñ character shows as completely empty.

I have yet to locate a function to clean text for XML formatting with PHP. I currently dump data from a database into this format, then load into SImpleXML on the receiving side to load back into a database.

A solution by either cleaning all data or possibly json encoding instead of XML possibly would be fantastic.

Thanks-Chris

Radium Chris
  • 61
  • 1
  • 8
  • If its server to server, why not base64url encode the keys/values, personally I would use RPC for this kind of thing. – Lawrence Cherone Jan 13 '18 at 03:21
  • Base64 encode / decode shows odd results $value = 'ñ'; print $value; $value = base64_encode($value); print base64_decode($value); First 2 lines by themselves prints the correct ñ , after encoding an decoding I get 2 characters сс – Radium Chris Jan 13 '18 at 15:34
  • Notice above base64**url** encode, if your using key like its not going to work, base64url encoding will prevent non safe chars. https://3v4l.org/LehjA, in my test though I could not turn `ñ` into `cc` so im not sure whats happening there. – Lawrence Cherone Jan 13 '18 at 16:04
  • I believe my issue with base64 was the same root cause as my first issue. The encoding going in is not UTF-8. I did not try to encode then change to base64. – Radium Chris Jan 13 '18 at 18:39

2 Answers2

0

For my instance, even though all my tables are set to UTF-8, When constructing my XML I have to set the values to UTF-8

$value=str_replace("'","&#039;",$value);
print '<'.$key.'>';
$value = utf8_encode($value);
print htmlspecialchars($value, ENT_XML1 | ENT_QUOTES, 'UTF-8');
print '</'.$key.'>'; 

Not sure when encoding is being changed between reading from table and placing but this has produced the results I required. I do not think BASE64 with special characters is viable.

Radium Chris
  • 61
  • 1
  • 8
0

If you use an XML Api (DOM, XMLReader) it will take care of encoding issues for values/text content. However tag names are a different issue. You will have to create a normalized tag name or use a fixed tag name. Then store the original field name as an attribute value.

For example with a fixed tag name field:

<records>
  <record>
    <field name="some field">some content</field>
  </record>
</records>

This is the cleaner variant, because here are no dynamic tag names, you can create a Schema/DTD and validate the XML.

Or a normalized version of the field name:

<records>
  <record>
    <some-field>some content</some-field>
  </record>
</records>

This is often used as a generic way to serialize a data structure as XML. It is only well formed XML, you can not define a Schema/XSD because the tag names depend on the data.

ThW
  • 19,120
  • 3
  • 22
  • 44