3

There I am trying to load XML content from a remote host using Node.js.

The problem is that German "umlaute" like "ä" are broken. Like in the browser this usually is a simple encoding problem. But since the XML content on the remote host is encoded in iso-8859-2" I had no success getting the letters back to work.

The functionality is very simple. I simply use the default HTTP client integrated in Node.js to connect to a remote host with a simple get request.

Some environment facts:

  • The remote system uses "iso-8859-2" encoding.
  • The encoding is currently set in the response header.
  • The characters are unrecoverable broken in the data (chunk) received by response.onData(chunk)

Node.js is running on version 0.2 on da default Debian server.

The code is based on the default httpClient like described in the Node.js documentation.

I tried the following:

response.defaultAsciiEncoding true/false
response.encoding = UFT-8/ascii

I used a UTF-8 encoder/decoder to encode/decode the chunk. After this failed I tried to encode/decode the whole response body.

I am not very familiar with using buffers, and I guess the problem must be in that direction. Or Node.js (or the httpClient) simply can't handle other encoding types by default witch is my second guess. In this case I need to write my own HTTP client using the net lib I think. I just want to make sure I don't walk into the wrong direction :)

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
agebrock
  • 459
  • 5
  • 7

3 Answers3

0

Try setting the encoding parameter in the XML declaration:

<?xml version="1.0" encoding="iso-8859-2" ?>
<xml>
  <!-- whatever -->
</xml>

XML files default to UTF-8 unless you explicitly declare their encoding.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • The remote source is dynamic and not under my control. But yes the xml version and encoding is set. I uploaded a sampleResponse to my server. I may add a node.js script as well to reproduce the error. The sample location is http://node.geht-ab.net/original.html – agebrock Sep 11 '10 at 19:29
  • @age: Not sure what this is supposed to be? It is served as text/html with no encoding parameter. – Tomalak Sep 11 '10 at 20:43
  • Yes sorry i forgot so simulate the header correctly. http://node.geht-ab.net/original.php I just added the Content-type header. Just the doctype of the xml is set to iso-8859-1 the response itself has no encoding information. Here is the original: Connection:keep-alive Content-Length:181706 Content-Type:text/xml Date:Sun, 12 Sep 2010 02:43:40 GMT Server:Apache – agebrock Sep 12 '10 at 08:01
0

It seems to me that Node.js can't work with encoding other than UTF-8. Maybe using something like node-iconv should work.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
svick
  • 236,525
  • 50
  • 385
  • 514
  • 1
    The problem here is there is, i can find no point / event to access the raw data. They look touched to me at response.onData(chunk). I'll may check the node.js libs to see whats going on. But in case i use the net.socket on port 80. The bindings you found couldt be usefull. – agebrock Sep 12 '10 at 09:22
0

I had a quick poke around the Node.js source and it seems like svick is right: Node.js doesn't support the ISO encoding. You can, however, get at the response as a binary stream and then either return it to the browser with your own encoding or use node-iconv (again as svick suggested).

Here's a little example: http://gist.github.com/576884

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
bxjx
  • 1,838
  • 17
  • 11
  • response.setEncoding("binary"); Did the trick i can't believe i didn't try that. Somehow i only tried using ascii here. For a quick prototype i used php.js utf8_encode. Works perfectly. Thanks for the answers and the link to the iconv bindings. – agebrock Sep 13 '10 at 21:40