0

I'm using simplexml_load_string to load an XML document into an object. This seemed to be working great up until I came across this element:

<some_string_val>1.&#160;&#160;&#160;&#160; Some text.</some_string_val>

After running that ran through simplexml_load_string, what came out was:

["some_string_val"]=> string(20) "1.     Some text"

I tried using:

html_entity_decode($string,  ENT_QUOTES, "Windows-1252");

And that seemed to convert the &#160;'s to plain text, but when I tried to run that through simplexml_load_string I get the same result. I also tried with UTF-8, and a few others, with similar or worse results.

So, what can I do to convert the &#160;'s to UTF-8 so it can be parsed correctly by simplexml_load_string? Keeping the HTML entities intact is not a concern because this is going into a CSV.

EDIT: This has been unjustly marked as a duplicate for a couple of reasons:

  1. This is not language agnostic; this is dealing with a specific set of PHP functions, unlike the post which this was marked a duplicate of
  2. This is not going to an HTML page or a PDF, it is going to a CSV, so I cannot set a header. The accepted solution will not work in my case
Samsquanch
  • 8,866
  • 12
  • 50
  • 89
  • `["some_string_val"]=> string(20) "1.    Some text"` is not output you see in your browser? I doubt that. I also verified it's a duplicate. For CSV files you might have to take a look in the manual of the software you open it with for *how to import a .csv file that uses UTF-8 character encoding*. The CSV file itself works very well with UTF-8 from PHP. – hakre Oct 11 '14 at 23:56

2 Answers2

0

I think it parses correctly. It just the way that function works, replacing those codes with special characters.

You can fix the result string, converting it into cp1251

$str = iconv('utf-8', 'cp1251', $str);

Also I would delete double spaces before writing it into CSV file

$str = str_replace(chr(160), ' ', $str);
$str= trim(preg_replace('/\s+/', ' ', $str));
Tengiz
  • 1,902
  • 14
  • 12
  • Would I need to do this before it feeds into `simplexml_load_string`? I tried after it had already gone through and didn't seem to do anything. I also did attempt to do it before, but I may have done it wrong. – Samsquanch Sep 19 '14 at 15:56
  • @Samsquanch after, when you work with that XML object. Or better way would be to replace all special characters ( ) to it's equivalent before parsing XML. In that case you might not need to convert encoding and you'll save some time – Tengiz Sep 19 '14 at 17:34
  • just run this before parsing that string: $str = preg_replace('/ /', ' ', $str); – Tengiz Sep 19 '14 at 18:09
  • Although I would have rather not done it this way, the `preg_replace` did end up working. Thanks. – Samsquanch Sep 19 '14 at 21:07
0

SimpleXML itself has no problem to properly parse the XML:

$string = '<some_string_val>1.&#160;&#160;&#160;&#160; Some text.</some_string_val>';
$xml = simplexml_load_string($string);
echo $xml;

Output (Demo):

1.     Some text.

What happens is that after you have read out that UTF-8 string (C2 A0), you send it to somewhere and tell that somewhere not that it's UTF-8 encoded, but in a different encoding. Most likely Latin-1, I have to guess, you didn't share this kind of information with your question.

That somewhere will then display the binary sequence C2 A0 as two characters:

  1. C2 Â
  2. A0 " " (No Break Space)

For example: You need to write into the CSV file. You can just write the data UTF-8 encoded there-in. When you open the CSV file in your spreadsheet application it should ask you about the encoding. Tell it to use Unicode UTF-8 Encoding. Then everything is fine.

If you display the array in your browser (this is how I read your question), then tell your browser that the website is in UTF-8. You should find an Encoding setting in your web-browsers menu to do that.

hakre
  • 193,403
  • 52
  • 435
  • 836