3

I am rewriting a small python script in node.js. The original script worked like this:

# -*- coding: utf-8 -*-
import urllib
import httplib
import json

def rpc(url, args = { }):
  try:
    post_data = json.dumps({'args': args})
    f = urllib.urlopen(url, post_data)
    if not f or f.code != 200:
      return { 'result': 1, 'error': 'urlopen returned error' }
    data = f.read()
    js_data = json.loads(data)
  except Exception, e:
    return { 'result': 2, 'error': e }
  else:
    return { 'result': 0, 'data': js_data }

print rpc('http://server.local/rpc', {'x': u'тест'})

I use request to do the same in node.js:

var request = require('request')

request.post('http://server.local/rpc', {
    json: {'x': 'тест'}
}, function(err, result) {
    console.log(err, result.body)
})

It works, but the unicode data is garbled, so that I get ÑеÑÑ instead of тест when querying the data back. It seems strange, given that both python and node.js should be sending utf8-encoded data.

Btw, the server is written in perl, I think, but that's all I know about it :(

Also, server returns unicode data on other queries, so it is able to do that.

Upd. my console prints unicode characters fine.

Upd. Rewrote my code to use node.js http module:

var http = require('http')

var options = {
  hostname : 'server.local',
  path     : '/rpc',
  method   : 'POST'
}    
var req = http.request(options, function (res) {
  res.setEncoding('utf8');
  res.on('data', function (chunk) {
    console.log('BODY: ' + chunk);
  });
});    
var body = JSON.stringify({'x': 'тест'})    
req.setHeader('Content-length', body.length)    
// python sends data with this header
req.setHeader('Content-type', 'application/x-www-form-urlencoded')

req.on('error', function (e) {
  console.log('problem with request: ' + e);
});    
req.end(body, 'utf8');

The results are sadly the same. Also same behavior on two different installations (my personal MBA and production Debian server). So it does seem to be something with the way node.js represents unicode data.

F0RR
  • 1,590
  • 4
  • 16
  • 30
  • Is your console unicode aware? Can you print a hardcoded `тест` in node? – poke Feb 07 '14 at 11:56
  • Yes, console is unicode aware. – F0RR Feb 07 '14 at 12:21
  • I have a feelng this could be the UCS-2 curse. Can you check the length of the body without setting any encoding, the default which is buffer. Or better print the entire buffer. – user568109 Feb 10 '14 at 10:13
  • To read more about it see http://mathiasbynens.be/notes/javascript-encoding – user568109 Feb 10 '14 at 10:27
  • Well, yeah, my first thought was that something was wrong with character conversion on my side, but the caracters in `тест` are within BMP and escape sequences from both python and node.js seem to be the same (\u0442\u0435\u0441\u0442). – F0RR Feb 10 '14 at 11:16
  • Also printing the buffer is problematic, given that there are other parameters in the body and the buffer is large, but new Buffer('{"x":"test"}').length is 12 and new Buffer('{"x":"тест"}').length is 16. – F0RR Feb 10 '14 at 11:18
  • OK, I just encountered this behaviour. Turns out this was console issue. How are you running this ? I had to change settings in putty console. – user568109 Feb 11 '14 at 05:59
  • The service is basically a simple CRUD server. Basically I add data with one call, then retrieve it with another one and show it on webpage. And if I add data with the old python script, the retrieved data is all ok, but if I add it with node, it is `ÑеÑÑ` – F0RR Feb 11 '14 at 07:14
  • I run it on UTF-8 osx 10.9.1 terminal and debian squeeze server (also unicode-enabled) – F0RR Feb 11 '14 at 07:17

4 Answers4

2

Here is the request made by a python script:

POST / HTTP/1.0
Content-Type: application/x-www-form-urlencoded
Content-Length: 43
Host: localhost:1234
User-Agent: Python-urllib/1.17

{"args": {"x": "\u0442\u0435\u0441\u0442"}}

Here is the request made by a node.js server:

POST /rpc HTTP/1.1
Host: localhost:1234
Content-length: 12
Content-type: application/x-www-form-urlencoded
Connection: keep-alive

{"x":"тест"}

Do you see an issue? JSON.stringify is encoding data to utf8 string, but python is encoding it to ascii.

If your rpc server doesn't understand utf8, you can encode json using external libraries. For example, this would work:

var request = require('request');
var jju = require('jju');

request.post({
   uri: 'http://localhost:8080/rpc',
   body: jju.stringify({args: {x: "тест"}}, {
       mode: 'json',
       indent: false,
       ascii: true,
   }),
}, function(err, res, body) {
    console.log(body);
});

With the code above the request would look like this:

POST /rpc HTTP/1.1
host: localhost:8080
content-length: 41
Connection: keep-alive

{"args":{"x":"\u0442\u0435\u0441\u0442"}}

Which is similar to what python is doing.

alex
  • 11,935
  • 3
  • 30
  • 42
  • Wow... Thank you very much! Also `jju` is ridiculously difficult to find with google, so here's a link for future generations : [jju](https://www.npmjs.org/package/jju) – F0RR Feb 13 '14 at 13:09
  • it's [on github](https://github.com/rlidwka/jju) and installable from npm under the same name, I didn't think it's that difficult – alex Feb 13 '14 at 13:26
  • Yeah, its just that google can't seem to find it :) Npm did, so not _that_ big a problem, but stil) – F0RR Feb 14 '14 at 08:37
0

Well, try eliminating variables.

Use the native http.request instead of the request module, even if it's more complicated, it'll eliminate request as a possible culprit.

When you send your data, make the utf8 encoding explicit.

I don't know enough about the internals of the request module to figure out where the breakdown might be occurring or if there are options you need to pass, but this would at least give you a way to figure out if the default node http.request could get you unstuck, or if there appears to be a deeper issue with your install.

Jason
  • 13,606
  • 2
  • 29
  • 40
0

The real problem is that you are telling the server you are going to send less data than you are actually sending. So when server try to encode data it gets corrupted.

body.length gives you the amount of 'elements' in the string body. For US-ASCII characters 1 element = 1 byte but this is not applicable for non US-ASCII characters.

With used logic every time a non US-ASCII is used you are adding a byte overweight that you should compute for the Content-length header.

http://en.wikipedia.org/wiki/UTF-8

Change line 15 to:

var utf8overLoad = encodeURIComponent(body).match(/%[89ABab]/g).length || 0;
var bodylength = body.length + utf8overLoad ;
req.setHeader('Content-length', bodylength);    
Mardie
  • 1,663
  • 17
  • 27
0

I think this might come from your server or your console.

I've just tested both client and server writtent in nodeJS, in a console with urt8 capacities (font 'Lucida Console') and it works. Server code:

var express = require('express');

var app = express()
  .use(express.methodOverride())
  .use(express.bodyParser())
  .post('/rpc', function(req, res) {
    console.log(JSON.stringify(req.body, null, 2));
    res.send(req.body.x);
  };

app.listen(8080);

(using express@3.4.8)

Console output on request:

{
   "x": "тест"
}

Client code:

var request = require('request');

request.post({
  uri:'http://localhost:8080/rpc',
  json:{"x": "тест"}
}, function(err, res, body) { 
   console.log(body); 
});

(using request@2.21.0)

Console output:

тест

The overall operation on NodeJS 0.10.4

It works also by using an HTTP client like curl or the "Advanced REST Client" chrome extension.

But the important thing is that the request is sent with "application/json" encoding (not the classical x-www-form-urlencoded), and that the bodyParser() middleware on server performs JSON deserialization.

Feugy
  • 1,898
  • 14
  • 18