9

when I read data from Stream API of twitter and then write to xmlfile.

But some special character like � will cause error (I mean when I open that xmlfile in Chrome, Chrome said that there was an error at that character!)

I want to convert that encoded sequence (�) into real character () before writing to xmlfile!

How to implement this?

-------------ADDED--------------

This is the XMLFile content:

<?xml version="1.0" encoding="UTF-8"?>
<root>
<text>@carlyraejepsen would be a dream if you follow me, please follow me?, I love you so much you're my inspiration</text>
<text>someone please bring me a caramel apple and a mocha from black cat. i'll love you forever</text>
<text>“@G_MartinFlyKick: Marry me Juliet.I love you and that's all I really know.”&#55357;&#56834;&#55357;&#56834;&#55357;&#56834;&#55357;&#56834;&#55357;&#56834;</text>
<text>"I need to see a picture of him cuz Im trying to imagine you guys making love and all I see is u climbing on top of a big question mark"lmao</text>
<text>@District3music hi, I LOVE YOU follow me please? &amp;lt;3 xx 23</text>
<text>RT @syardley_: So appreciative of my family and people I love, wouldn't be where I am without them. #thankful</text>
<text>#DISTRICT3HALLOWEENFOLLOWSPREE #DISTRICT3HALLOWEENFOLLOWSPREE #3EEKERFROMTHENETHERLANDS love you! Please follow ? @District3music x42</text>
<text>Arguably my favorite electronic music producer @Kluteuk is coming back to Toronto on Dec 22nd. So stoked. Guy has made so many tunes I LOVE.</text>
<text>The stakes are high, the water's rough, but this love is ours.</text>
<text>@NiallOfficial Answer me, I love you very much. Venezuela loves. jhgj</text>
<text>Love this shit http://t.co/qSP79NKx</text>
</root>

And here is error from Chrome:

This page contains the following errors:

error on line 5 at column 91: xmlParseCharRef: invalid xmlChar value 55357
Below is a rendering of the page up to the first error.
Songokute
  • 687
  • 1
  • 9
  • 17

2 Answers2

19

The character reference &#55357; denotes a surrogate code point (U+D83D), so it would be wrong to try to convert it to a character. It is not a character, not even half a character.

You need to track back to the point where the reference was generated. The reason might be a character encoding confusion. In UTF-16, surrogate code units may appear but must be handled in pairs when the data is interpreted as characters and e.g. converted to another encoding or turned to character references.

Jukka K. Korpela
  • 195,524
  • 37
  • 270
  • 390
  • I retrieve data from this link: https://stream.twitter.com/1.1/statuses/filter.json?delimited=length&track=love, so, how encoding confusion occur? – Songokute Oct 31 '12 at 18:49
  • @Songokute, hard to tell, because the page prompts for a user name and password. – Jukka K. Korpela Oct 31 '12 at 18:59
  • 2
    Judging by the XMLFile content, it seems that the data contains characters like U+1F602 “”, which means that it occupies two code units in UTF-16. Apparently the original data is UTF-16 and should first be converted to UTF-8. – Jukka K. Korpela Oct 31 '12 at 19:03
  • Sorry for replying late! But you can use your twitter account to login! (Because this is twitter stream API). And, do you mean that I should change encode of XMLFile from UTF-8 to UTF-16? – Songokute Oct 31 '12 at 23:36
  • Yea, I did it, thanks @Korpela, I set encoding to UTF-16 for xmlObject before writing it down :) – Songokute Nov 01 '12 at 00:50
-1

You can use regular expressions to replace it after the server response. simple example in python:

import re 
pattern = re.compile(r'&#')
new_content = pattern.sub(' ', SERVER_RESPONSE)
Boseam
  • 165
  • 2
  • 2