20

I'm using: Module: Request -- Simplified HTTP request method to scrape a webpage with accented characters á é ó ú ê ã etc.

I've already tried encoding: utf-8 with no success. I'm still getting this ��� characters in the result.

request.get({
    uri: url,
    encoding: 'utf-8'
    // ...

Is there any configuration to fix it?

I don't know if it is an issue, but I filled one for this module. No answers yet. :/

dsh
  • 12,037
  • 3
  • 33
  • 51
Pablo Cantero
  • 6,239
  • 4
  • 33
  • 44
  • Well, what encoding is the web page written in? utf8? iso-something? – thejh Nov 30 '11 at 20:34
  • 4
    I answered you in the issue (https://github.com/mikeal/request/issues/118#issuecomment-2965894). I don't know why, but I used 'binary' for the encoding and it worked. – Pablo Cantero Nov 30 '11 at 21:45
  • 4
    Also for me, just adding `encoding: binary` worked great – Renato Gama Feb 14 '14 at 15:43
  • 1
    @renatoargh, it will work great until that website will change encoding. After that it'll break suddenly. Use iconv instead, and do a proper decoding depending on content-type, unless you're doing one-time job and don't care. – alex Feb 15 '14 at 00:10
  • @alex I will have a look! It is an important job, thank you – Renato Gama Feb 15 '14 at 10:07

4 Answers4

27

Since binary is deprecated it seems like a better idea to use iconv and correctly handle the decoding:

var request = require("request"), iconv  = require('iconv-lite');
var requestOptions  = { encoding: null, method: "GET", uri: "http://something.com"};

request(requestOptions, function(error, response, body) {
    var utf8String = iconv.decode(new Buffer(body), "ISO-8859-1");
    console.log(utf8String);
});

The important part is to set the encoding on the HTTP request to be null encoding: null.

row1
  • 5,568
  • 3
  • 46
  • 72
  • 1
    This works great, but I have two questions. 1. why do you need to create new Buffer for body? I tried to use body directly and didn't see any difference. What do I miss? 2. If the web page says charset=utf-8, why do I have to use iconv-lite to convert it to ISO-8859-1? – newman Mar 04 '15 at 16:31
2

Specify the encoding as utf8 not utf-8. Here are a list of possible encodings for a buffer from the Node.js documentation.

  • ascii - for 7 bit ASCII data only. This encoding method is very fast, and will strip the high bit if set.
  • utf8 - Unicode characters. Many web pages and other document formats use UTF-8.
  • base64 - Base64 string encoding.
  • 'binary - A way of encoding raw binary data into strings by using only the first 8 bits of each character. This encoding method is depreciated and should be avoided in favor of Buffer objects where possible. This encoding will be removed in future versions of Node.
DHamrick
  • 8,338
  • 9
  • 45
  • 62
  • 2
    utf-8 works as utf8. The page that I'm scrapping is iso-8859-1. The only encoding that worked for me was "binary"... too strange... We discussed about it here https://github.com/mikeal/request/issues/118 – Pablo Cantero Dec 01 '11 at 22:02
  • 1
    binary works for me. I`m using request module, i passed encoding: 'binary' in the options. Thank you – Marcos Mendes May 03 '17 at 21:30
0

I were tried and OK (Shift_JIS):

var concat  = require('concat-stream'),
    Iconv   = require('iconv').Iconv,
    request = require('request');

var conv = new Iconv('Shift_JIS', 'utf8'),
    req  = request('http://www.alc.co.jp/');

req.pipe(conv);

req.on('error', function() {
    console.log('an error occurred');
});

conv.pipe(concat(function(body) {
    console.log(body.toString());
}));

https://github.com/request/request/issues/1080#issuecomment-56172161

Tính Ngô Quang
  • 4,400
  • 1
  • 33
  • 33
0

Not a direct answer to OP, but I hate a similar problem and might help someone.

I had the issue because there was a gzip compression, so it needs to be decompressed first

var headers = {
        'Accept-Encoding': 'gzip',
    };
request({url:url, 'headers': headers, encoding:null},(e,r,b)=>{zlib.gunzip(b, (e,b)=>{console.log(b.toString())}) })
Nertan Lucian
  • 362
  • 4
  • 16