I am currently having an issue with PhantomJS (version 2.1.1/Windows 7) not recognising UTF-8 characters. Prior to asking this question, I have found the following two articles useful to configuring the command prompt:
As suggested by the second article, I used the command
chcp 65001
to change the code page to UTF-8. I then also set the command prompt's default font to Lucida console.
To test this had worked, I created the following UTF-8 text file
---------------------------------------------------------
San José
Cañon City
Przecław Lanckoroński
François Gérard Hollande
El Niño
vis-à-vis
---------------------------------------------------------
and then ran the following command to demonstrate whether characters were being recognised and correctly displayed by the command prompt:
type utf8Test.txt
After this worked, I turned my attention to PhantomJS. Following the instructions here i created the below settings json file to ensure that UTF-8 is the input and output character encoding (though this appears to be the default for according to the official documentation).
{
"outputEncoding: "utf8",
"scriptEncoding": "utf8"
}
I then ran the following JavaScript through PhantomJS using the aforementioned json settings file in the same command prompt window:
console.log("---------------------------------------------------------");
console.log("San José");
console.log("Cañon City");
console.log("Przecław Lanckoroński");
console.log("François Gérard Hollande");
console.log("El Niño");
console.log("vis-à-vis");
console.log("---------------------------------------------------------");
page = require('webpage').create();
// Display the initial requested URL
page.onResourceRequested = function(requestData, request) {
if(requestData.id === 1){
console.log(requestData.url);
}
};
// Display any initial requested URL response error
page.onResourceError = function(resourceError) {
if(resourceError.id === 1){
console.log(resourceError.status + " : " + resourceError.statusText);
}
};
page.open("https://en.wikipedia.org/wiki/San_José", function(status) {
console.log("---------------------------------------------------------");
phantom.exit();
});
The output from running this script is shown below:
From this I can see that PhantomJS is not able to understand UTF-8 special characters, and furthermore it passes the "unknown" character to websites when provided with a special or accented character as below:
URL passed to PhantomJS:
https://en.wikipedia.org/wiki/San_José
URL passed to remote host:
https://en.wikipedia.org/wiki/San_Jos%EF%BF%BD
----------------------------------------------
%EF%BF%BD
�
instead of:
%C3%A9
é
This causes websites to respond with '400 : Bad Request' errors, and in the case of Wikipedia specifically, requesting the URL https://en.wikipedia.org/wiki/San_Jos%EF%BF%BD results in an error message of:
Bad title - The requested page title contains an invalid UTF-8 sequence.
So, with all this being said, does anyone know how to remedy this? There are many websites these days that use UTF-8 special/accented characters in their page urls, and it would be great if PhantomJS could be used to access them.
I really appreciate any help or suggestions you can provide me with.