1

I am currently having an issue with PhantomJS (version 2.1.1/Windows 7) not recognising UTF-8 characters. Prior to asking this question, I have found the following two articles useful to configuring the command prompt:

As suggested by the second article, I used the command

chcp 65001

to change the code page to UTF-8. I then also set the command prompt's default font to Lucida console.

To test this had worked, I created the following UTF-8 text file

---------------------------------------------------------
San José
Cañon City
Przecław Lanckoroński
François Gérard Hollande
El Niño
vis-à-vis
---------------------------------------------------------

and then ran the following command to demonstrate whether characters were being recognised and correctly displayed by the command prompt:

type utf8Test.txt

UTF-8 accented recognised and correctly displayed by the command prompt

After this worked, I turned my attention to PhantomJS. Following the instructions here i created the below settings json file to ensure that UTF-8 is the input and output character encoding (though this appears to be the default for according to the official documentation).

{
    "outputEncoding: "utf8",
    "scriptEncoding": "utf8"
} 

I then ran the following JavaScript through PhantomJS using the aforementioned json settings file in the same command prompt window:

console.log("---------------------------------------------------------");

console.log("San José");
console.log("Cañon City");
console.log("Przecław Lanckoroński");
console.log("François Gérard Hollande");
console.log("El Niño");
console.log("vis-à-vis");

console.log("---------------------------------------------------------");

page = require('webpage').create();

// Display the initial requested URL
page.onResourceRequested = function(requestData, request) { 
    if(requestData.id === 1){
        console.log(requestData.url);
    }
};

// Display any initial requested URL response error
page.onResourceError = function(resourceError) {
    if(resourceError.id === 1){
        console.log(resourceError.status + " : " + resourceError.statusText);
    }
};

page.open("https://en.wikipedia.org/wiki/San_José", function(status) {
    console.log("---------------------------------------------------------");
    phantom.exit();
});

The output from running this script is shown below:

UTF-8 accented characters not displayed by PhantomJS

From this I can see that PhantomJS is not able to understand UTF-8 special characters, and furthermore it passes the "unknown" character to websites when provided with a special or accented character as below:

URL passed to PhantomJS:   
https://en.wikipedia.org/wiki/San_José

URL passed to remote host: 
https://en.wikipedia.org/wiki/San_Jos%EF%BF%BD

----------------------------------------------

%EF%BF%BD
�

instead of:

%C3%A9
é

This causes websites to respond with '400 : Bad Request' errors, and in the case of Wikipedia specifically, requesting the URL https://en.wikipedia.org/wiki/San_Jos%EF%BF%BD results in an error message of:

Bad title - The requested page title contains an invalid UTF-8 sequence.

So, with all this being said, does anyone know how to remedy this? There are many websites these days that use UTF-8 special/accented characters in their page urls, and it would be great if PhantomJS could be used to access them.

I really appreciate any help or suggestions you can provide me with.

Community
  • 1
  • 1
  • 1
    Isn't the correct URL for wiki page is actually https://en.wikipedia.org/wiki/San_Jos%C3%A9 ? It is opened by PhantomJS without issues. – Vaviloff Feb 05 '17 at 16:51
  • Thanks for getting back to me. You are correct, that is the url that I am attempting to access in the example. However unlike chrome and as shown at the end of the second screenshot, PhantomJS does not translate 'é' to %C3%A9, because it does not recognise the character in the script. Instead it translates it to '�' or %EF%BF%BD (character unknown). The main issue for me is that reading a set of URLs from a text file or directly from within a script is not possible if UTF8 special/accented characters are present. And more generally I'd like to know how to make PhantomJS read such characters. – Derek Σωκράτης Finch Feb 05 '17 at 17:40

1 Answers1

2
var url = 'https://en.wikipedia.org/wiki/San_José';

page.open(encodeURI(url), function(status) {
    console.log("---------------------------------------------------------");
    console.log(page.evaluate(function(){ return document.title }));
    phantom.exit();
});

enter image description here

Yes, it's garbling those symbols on Windows (on Linux it works beautifully) but at least you will be able to open pages and process them.

Vaviloff
  • 16,282
  • 6
  • 48
  • 56
  • Thanks, I had forgotten about encodeURI(), that works really well :-) So that answers one half of my question for sure, but I am still wondering how to make PhantomJS display such characters in Windows. From the testing described in the initial part of my question above, it seems to me that the Windows command prompt is now capable of displaying them. Therefore I am wondering if there is anything I am missing regarding PhantomJS encoding configuration to make it recognise and correctly display such characters. – Derek Σωκράτης Finch Feb 05 '17 at 18:24
  • Also I just realised looking at your screenshot that 'Przecław Lanckoroński' has been displayed without any unknown characters, although the characters in question for that example (namely 'ł' and 'ń') appear to have been converted to 'l' and 'n' respectively. Did you alter that specific text all in your test? Just curious as the other 5 appear the same as they do on my setup. And yes indeed, on Linux it does appear to work beautifully! :-) – Derek Σωκράτης Finch Feb 05 '17 at 18:30
  • 1
    Regarding Przecław Lanckoroński - for some reason the name got copied without polish symbols. As for the rest, it really does seem PhantomJS has trouble working with UTF-8 in console, seeing as there are always two blank boxes instead of a symbol. – Vaviloff Feb 05 '17 at 18:46
  • Ah ok, that's great thanks for letting me know. Yeah its just a pain, but if I dont hear from anyone else regarding how to solve this, I will mark your encodeURI() answer as the full answer as it at least enables working with such characters in URLs. – Derek Σωκράτης Finch Feb 05 '17 at 19:44