4

I have a text file test.txt with the UTF8 encoded content äöü (these are German umlauts and just an example, the file size is 6 Bytes). I also have a Cygwin terminal on a Windows 10 PC with the correct LANG settings:

$ cat test.txt
äöü

I'd like to print the content of this file with a Perl script, but can't get it to work.

open my $fh, '<', 'test.txt';
print <$fh>;
close $fh;

results in

$ perl test.pl
├ñ├Â├╝

I tried all variations I found at How can I output UTF-8 from Perl? - none of them solved my problem. What's wrong?

EDIT per request:

$ file test.txt
test.txt: UTF-8 Unicode text, with no line terminators

$ echo $LANG

I also tried setting LANG to de_DE.UTF-8.

EDIT to narrow down the problem: If I try this with the Perl version 5.32.1 included in Cygwin, it works as expected. It still doesn't work in Strawberry Perl version 5.32.1. So it's probably no Perl problem nor a Windows problem nor something with language or encoding settings, it's a Strawberry Perl problem.

André
  • 405
  • 5
  • 16
  • It works for me. So there is an `encoding` issue at your side. What is the output of `file test.txt` and of `echo $LANG` ? if you identify the encodings you can use `iconv` for conversion. – matzeri Feb 19 '21 at 14:09
  • Thanks for trying it. I edited my question and added the results from your commands. – André Feb 19 '21 at 14:16
  • What do the following 3 commands give you: od -t x1 test.txt; cat test.txt | od -t x1; perl test.pl | od -t x1 – Dave Mitchell Feb 19 '21 at 15:25
  • Thanks for this idea. It gives three times "0000000 c3 a4 c3 b6 c3 bc". The output of the script seams to be OK, but `cat test.txt` works, `perl test.pl` doesn't. – André Feb 19 '21 at 18:00
  • It works fine for me from the Cygwin terminal window if I add `binmode(STDOUT,':encoding(utf-8)')` as suggested in the answer by @AnFi (and also add the `encoding(utf-8)` on the input file handle. – Håkon Hægland Feb 19 '21 at 18:10
  • @HåkonHægland: This still gives me `├ñ├Â├╝`. Which perl version and which perl environment did you use? – André Feb 19 '21 at 18:17
  • @André perl version 5.30.3 installed from the Cygwin `perl` package and installed into `/usr/bin/perl`. Btw `od -c test.txt` shows the same as in the answer by @matzeri – Håkon Hægland Feb 19 '21 at 18:22
  • 1
    @HåkonHægland: I tried this and it worked. I also tried the Cygwin Perl version 5.32.1 and this works too. But with the Strawberry Perl version 5.32.1 it doesn't work. So, it's no Perl problem nor a Windows problem - it's a problem of the Strawberry environment. Thanks a lot! – André Feb 19 '21 at 18:29

4 Answers4

2

If you are in a cmd.exe window or in PowerShell, you can change the codepage to 65001:

chcp 65001

If you do not want to change the codepage find out what chcp (or "cp".Win32::GetConsoleOutputCP()) returns and encode to that encoding.

use Encode;

open my $fh, '<:utf8','test.txt';
while(<$fh>){
    print encode('cp850',$_); # needs a recent Encode to support cp850
};
close $fh;

If you are in cygwin bash, you can call chcp with system() like so:

use strict;
use warnings;
use Encode;

system("chcp 65001 > NUL");

open my $fh, '<:utf8','test.txt';
while(<$fh>){   
    print encode('utf8',$_); # needs a recent Encode to support cp850
};
close $fh;
clamp
  • 2,552
  • 1
  • 6
  • 16
  • Thanks for this idea. `chcp` works for cmd indeed. I'm still looking for a solution for the cygwin bash. With a bash at a Linux machine it works out of the box, but it doesn't at Windows. The second solution changes the output encoding to cp850, but only a few unicode characters can be mapped to cp850. It works for the umlauts in my example above, but my real script uses many different characters. – André Feb 19 '21 at 18:12
  • If I call `cmd /c chcp 65001` from within a bash, it works in the bash too. This still looks like an ugly workaround ;-) – André Feb 19 '21 at 18:22
1

It seems you are missing the LANG setting

$ export LANG=de_DE.UTF-8

$ echo $LANG
de_DE.UTF-8

$ cat test.txt
äöü

$ perl test.pl
äöü

$ file test.txt
test.txt: UTF-8 Unicode text

$ od -c  test.txt
0000000 303 244 303 266 303 274  \n
0000007

$ which perl
/usr/bin/perl
matzeri
  • 8,062
  • 2
  • 15
  • 16
  • I tried this already. I get exactly the same output after executing your command - except for `perl test.pl`. It's crazy. – André Feb 19 '21 at 14:27
  • what terminal are you using ? And are you sure of perl ? – matzeri Feb 19 '21 at 14:29
  • I use the Cygwin bash, but also tried the bash inside "Windows Terminal", I tried "cmd.exe" and the "Windows PowerShell". I'm sure about the output of the perl script. The PC itself is only a few days old and setup without any very special software. Beside this, I'm using Strawberry perl (strawberryperl.com). – André Feb 19 '21 at 14:35
  • No, I'm used to use Strawberry, because I can install all required packages there easily. I tried my script on a Linux machine now and it works as expected. This is probably no Perl problem but a Windows problem. On the other side: cat test.txt works ... – André Feb 19 '21 at 14:40
1
$ "$( cygpath 'C:\progs\sp5302-x64\perl\bin\perl.exe' )" -M5.010 -e'
   use Win32;
   BEGIN {
      Win32::SetConsoleCP(65001);
      Win32::SetConsoleOutputCP(65001);
   }
   use open ":std", ":encoding(UTF-8)";
   say chr(0x2660);
'
♠

(BEGIN { `chcp 65001` } would also have done the trick.)

ikegami
  • 367,544
  • 15
  • 269
  • 518
0

You may explicitly define encoding of input and output.

open( my $fh, '<:utf8', 'test.txt');
binmode(STDOUT,':utf8');
print <$fh>;
close $fh;
AnFi
  • 10,493
  • 3
  • 23
  • 47