3

I am trying to access an online API that returns an .xml from a Perl Script and it uses the Catalan alphabet: à,é,è,í,ò,ó,ú,·,ç .

I am using Perl's URI::Escape, so a "MWE" (without the actual URL of the dictionary I am trying to access, just in case it is considered spam or whatever) of what I am trying to do would be

use LWP::Simple;
use URI::Escape;
use utf8;

my $word = <STDIN>;
$word = uri_escape_utf8($word);
my $xmlweb = get("http://www.urlofthedictionary.com/search?q=$word&format=text/xml");

It "works", i.e. no error shows up, but it does not work properly (no results for the word are given if it contains any of these special characters). For example if I write país then uri_escape_utf8() returns pa%C2%A1s%0A , but I have seen that if I copy that exact same string to the url in my navigator, then it searches pais (instead of país) giving no results, and even in the URL it gets "translated" to pais. If I just use uri_escape() then the website gives an error: Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='

This is driving me insane, I always have problems with encodings. Does anybody know what am I doing wrong? If the dictionary's url is needed I will provide it, no problem with that.

  • 1
    Seems like a problem of the site itself... encoding seems proper. What exact site is that? – Flash Thunder May 21 '14 at 19:06
  • @FlashThunder [link](http://openthesaurus.softcatala.org/synonyme/search?q=prova&format=text/xml) for an example (prova would be the word searched here). I have just discovered that the site's escaping for í is %C3%AD instead of %C3%A1, is that another different encoding? utf-16 maybe? As you see there is no version tag. –  May 21 '14 at 19:11
  • 1
    C3.A1 is the UTF-8 of U+00E1, LATIN SMALL LETTER A WITH ACUTE – ikegami May 21 '14 at 19:28

2 Answers2

3

Problem 1. You forget to remove (chomp) the trailing newline (%0A).


Problem 2. uri_escape_utf8 expects Unicode code points, but I don't think you provided that. You need to decode what you got from STDIN. You can use:

use encoding ':std', ':encoding(cp850)';

850 was obtained from the output of chcp. It could be different for you.


$ perl -MURI::Escape=uri_escape_utf8 -E'
   say uri_escape_utf8 "pa\N{LATIN SMALL LETTER I WITH ACUTE}n";
'
pa%C3%ADn
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • this seems to be an answer – Flash Thunder May 21 '14 at 19:16
  • I see, I did not remove it because it worked with words without accents (for example prova%0A works). I have done it now but it still does not work, it appears that something may go wrong in the middle because it escapes í to %C3%A1 instead of %C3%AD . Maybe it has something to do with Strawberry Perl's encoding? –  May 21 '14 at 19:20
  • @ikegami Thank you very much, this solved the encoding of the terminal but now I have problems with the encoding of the xml. I will read [this thread](http://stackoverflow.com/questions/15224400/perl-on-windows-problems-with-encoding) that is related. Just one more question, is there any way to make this system-independent? –  May 21 '14 at 19:31
  • 1
    No. I'd write up a module, but noone's been able to tell me how to determine the encoding to use on unix machines. – ikegami May 21 '14 at 19:34
1

If I set binmode(STDIN,'utf8') before reading from STDIN and also make sure that my terminal sends also UTF8, then I get the correct encoding %C3%AD.

Steffen Ullrich
  • 114,247
  • 10
  • 131
  • 172