6

How can I read special characters from a external file ? Here a simple .txt file in French, which content is the first paragraph of https://fr.lipsum.com/ : as you can see on my screenshot, the file encoding is UTF-8 but the accents are not displayed correctly.

I tried various encodings within notepad++ and in my perl6 script, like these :

enc => "utf8"
enc => "latin1"

With Python or Ruby scripts I don't encounter the problem. I can't found any precise example about that matter, probably because perl 6 is still quite recent (??). Thank you.

My script as it is displayed in the screenshot :

my $text_contents = slurp "testfile.txt", enc => "utf8";
say $text_contents;
prompt;

Perl6 script, input file in notepad++, exec in cmd.exe


Final edit : the solution is to enable an option, available in beta state with Windows 10 1803, to make the OS handle unicode characters properly : see answers and comments below ...

Frenzowski
  • 63
  • 5
  • Please provide the encoding used for the `.txt` file (as shown in the screen shot image), also provide a snippet of the `.txt` file as *text* (not as image) in your question. You should also post the perl 6 script as text, this will help us with copy and paste trying to reproduce your behavior. Thanks! – Håkon Hægland Mar 15 '19 at 10:24
  • 1
    By default slurp reads as UTF-8 (and the from screenshot it looks like that's the encoding of your file). What happens if you create a UTF8 character directly in perl6 and then output that? EG : `perl6 -e 'say "\c[Latin Small Letter A with Acute]"'` If that outputs á then you're OK. If not then the problem isn't reading the file but your command line can't handle UTF8 output. I've not got a windows machine at hand to test it on though. – Scimon Proctor Mar 15 '19 at 10:56
  • Another check to run `type testfile.txt` does that output utf8 character correctly? – Scimon Proctor Mar 15 '19 at 11:02
  • @Scimon : In cmd.exe, the first command you mention does not work : `perl6 -e 'say "\c[Latin Small Letter A with Acute]"'` ===SORRY!=== Error while compiling -e Unable to parse expression in single quotes; couldn't find final "'" (corresponding starter was at line 1) at -e:1 ------> 'say expecting any of: single quotes term – Frenzowski Mar 15 '19 at 11:45
  • `type testfile.txt` gives me the same output than Rakudo – Frenzowski Mar 15 '19 at 11:46
  • 1
    So there's the problem. Your console isn't able to display UTF8 correctly. https://stackoverflow.com/questions/388490/how-to-use-unicode-characters-in-windows-command-line This answer may help. – Scimon Proctor Mar 15 '19 at 12:05
  • 1
    The issue with the command here is that Windows cmd treats single quotes as a regular character. Please try `perl6 -e "say qq/\c[Latin Small Letter A with Acute]/"`. Do you have Windows PowerShell on your machine? I recommend trying the same command on that to see if you encounter the same issue. (Apologies for the lack of line breaks, SO's mobile site doesn't seem to let me enter them properly) – Daniel Mita Mar 15 '19 at 12:10
  • @DanielMita In both CMD and Powershell : `>perl6 -e "say qq/\c[Latin Small Letter A with Acute]/"` `>├í` ps : I'm new posting on SO, I did not manage to include line breaks neither (even with two spaces at the end of line) – Frenzowski Mar 15 '19 at 12:38
  • @Scimon Thank you for the link, unfortunately it's very long so I would like to find a tl;dr for this – Frenzowski Mar 15 '19 at 12:44

1 Answers1

6

If you're not using Windows

This SO is either entirely or almost entirely irrelevant to you.

If you're using Windows 10

Check the "Beta: Use Unicode UTF-8 for worldwide language support" option checkbox.

At least at the time I originally wrote this answer, text near this Unicode related checkbox claimed it's for programs that do not support Unicode, but you should just ignore that.[1]

At the time I originally wrote this answer the checkbox was found under control panel, "Region" entry, "Administrative" tab, "Change system locale" button.

Microsoft may have changed this stuff since I wrote this answer, and may change it again, eg by moving and/or renaming the checkbox, or making things more involved than just clicking a single checkbox.

Per their comment below this answer, the OP notes:

For those who are interested in that particular option, it can be found in the "legacy" Control panel of windows -> Region -> Administrative -> Edit settings...

If you're using an older version of Windows

Arguably, the good news is that Raku and Rakudo have some of the world's best modern support for Unicode, and the OK news is that it relies on Microsoft correctly supporting Unicode, which they're now trying to do.

The bad news is that they made a lot of mistakes in older versions of Windows (and even in Windows 10, which they're now trying to fix), so any solution will be constrained by those mistakes. (Perhaps the biggest problem is Microsoft's doublespeak on the topic[1], but let's hope we can work around that.)

That all said, please read the following and then either return to searching for solutions or post a fresh SO question and we'll try to help.


Quoting Wikipedia's page Unicode in Microsoft Windows:

they are still in 2018 improving their operating system support for UTF-8

Microsoft got off on the wrong foot with their Unicode support last century. The good news is that they have at last begun digging their way out of the hole they dug for themselves and everyone else.

But they're definitely not there yet -- not at the time of originally writing this answer, and, I suspect not for another N years -- at least inasmuch as things don't work correctly out of the box for many end users. I think this is the root of most problems with Unicode on Windows.

Older languages like Python, Ruby and Perl came up with a range of hacks that hid the many problems with Microsoft's older UTF8 support from most users in simple scenarios by using what Microsoft ironically described as "Unicode support".

This has always come with the trade-off that things get very hairy or even completely unworkable for more complex applications in many locales around the world. (So much so that even the mighty Microsoft finally capitulated in 2018.)

In essence, until this new Microsoft effort to get with the program, software that ran on Windows has had no alternative but to either use their fundamentally broken "Unicode support" or to actually support Unicode properly.[1]

Raku and Rakudo focused on the latter, and problems with it when run on Windows are related to this conflicting with Microsoft's old broken approach. Fortunately Microsoft is now getting with the program and so we may be able find a way to get around problems you have with Unicode on Windows provided you are patient.

In particular, if you are using an older Windows version, please expect it to not work at first with modern Unicode aware software unless you are lucky. We'll still help if we can, but it'll likely involve you being patient with us and Microsoft and Rakudo and vice-versa.

Footnotes

[1] At the time I originally wrote this answer, there is text near the checkbox that it's for programs that do not support Unicode. This is entirely the opposite of what's really going on, but hey, it's Microsoft.

raiph
  • 31,607
  • 3
  • 62
  • 111
  • I'm just trying Perl6 out of curiosity and I admit I am a bit lazy so I think I will wait up until improvements are made to reconcile Perl6 and special characters together. Thank you very much for your detailed answer ! – Frenzowski Mar 17 '19 at 14:45
  • Hi @Frenzowski P6 is grounded in use of Unicode so it works well with Unicode. I'm not sure it'll ever tackle issues related to non-Unicode characters and I doubt Microsoft will ever make a change other than the one they're trying in Windows 10. Are you using Windows 10? If so, I'm hoping you're not so lazy you didn't try clicking the option Microsoft provided and am curious what happened. If not, it would still be helpful to hear what version of Windows you're using. Thanks for any response. – raiph Mar 17 '19 at 15:44
  • 2
    Hi @raiph, you are right, I'm not that lazy. Yes, I use Windows 10 and actually I thought it was about downloading a beta version of the OS. Silly me. I just checked the option, restarted my computer, and got the characters displayed well in UTF-8, with the `type ` command. Alleluia ! For those who are interested in that particular option, it can be found in the "legacy" Control panel of windows -> Region -> Administrative -> Edit settings... Thank you for bringing back my attention on the topic – Frenzowski Mar 18 '19 at 20:01