How to read text file contents without loss of characters in perl

Question

I have the following text in text file (.txt):

Feste begründen die Identität einer Gemeinschaft und ihr Selbstverständnis nach innen. Eng damit verbunden sind Emotionen, die zunächst im Zusammenhang mit einer gefühlten Zugehörigkeit zu einer Fest-Gemeinschaft zu verstehen sind. Mit jedem Fest verbindet sich aber auch eine emotionale Überschreitung des Alltags: Der bestimmende festliche Eindruck – die feierliche Gestimmtheit – ist der einer erhöhten Bedeutungshaftigkeit des Lebens, durch die sich das Festliche aus dem Lauf des Alltagslebens hervorhebt und dessen Wirkmächtigkeit zuvörderst anhand der Analyse des bürgerlichen Geburtstages sinnfällig demonstriert werden soll.

when I read this text from .txt file, I am getting the text as shown below :

Feste begründen die Identität einer Gemeinschaft und ihr Selbstverständnis nach innen. Eng damit verbunden sind Emotionen, die zunächst im Zusammenhang mit einer gefühlten Zugehörigkeit zu einer Fest-Gemeinschaft zu verstehen sind. Mit jedem Fest verbindet sich aber auch eine emotionale Überschreitung des Alltags: Der bestimmende festliche Eindruck die feierliche Gestimmtheit ist der einer erhöhten Bedeutungshaftigkeit des Lebens, durch die sich das Festliche aus dem Lauf des Alltagslebens hervorhebt und dessen Wirkmächtigkeit zuvörderst anhand der Analyse des bürgerlichen Geburtstages sinnfällig demonstriert werden soll.

You can see the en-dash not present in the above text, But I want the exact text as in the (.txt) file, I also used UTF-8 but still getting without en-dash.

I am looking for your ideas to solve this in Perl.

Not enough information for a meaningful answer. What is the encoding of your input file? How are your reading data from the input file? How are you decoding the data you read from your input file. What encoding do you want in your output file? How are you writing data to your output file? How are your encoding the data you write to your output file. Basically, we need to see far more of your code. — Dave Cross, Apr 21 '14 at 09:32

score 0 · Accepted Answer · edited May 23 '17 at 12:22

0

Try to begin your script like this:

#!/usr/bin/perl -CS

use open IO => ':utf8';

and then open, read, and output normally, that pragma will instruct Perl to use UTF8 encoding for all input and output, and option -CS will turn on Unicode support for STDIN, STDOUT and STDERR.

You need to run your script in one of the following ways:

add execute permission to it, and use ./script.pl to run it, or
use perl -CS /path/to/script.pl

Reference:
perlrun
open
use utf8 gives me 'Wide character in print'

This script should be able to create an exact copy (checked with diff) of this file. When it is running, the value of ${^UNICODE} should be 7, it will print the value of this variable to STDERR.

#!/usr/bin/perl -CS

use strict;
use warnings;

use open IO => ':utf8';

use feature qw(switch say);

print STDERR "\${^UNICODE} = ${^UNICODE}\n";

use Data::Dumper;

open my $fh, '<', $ARGV[0] or die "Cannot open $ARGV[0]: $!";

while (<$fh>) {
    print;
}

edited May 23 '17 at 12:22

Community

1
1

answered Apr 21 '14 at 07:58

Lee Duhem

14,695
3
29
47

Thanks Mr.Lee, I used the above line but I still getting the same output as without en-dash, i tried also #binmode STDOUT, ":utf8"; but not worked! but thanks for your reply, pls comment if you have any other idea – user3354853 Apr 21 '14 at 08:17
@user3354853 Please make sure your input file is using UTF8 encoding. – Lee Duhem Apr 21 '14 at 08:25
Sure Mr.Lee will comment you back after i tried this thanks a lot – user3354853 Apr 21 '14 at 08:34
Mr.Lee I tried using input file as UTF8 but still same error ! – user3354853 Apr 21 '14 at 09:13
"that pragma will instruct Perl to use UTF8 encoding for all input and output" - actually no, that pragma will only affect output. To affect both input and output, you need `use open IO => ':utf8'`. But that assumes that the input and output are both UTF-8. We don't know that from the question. – Dave Cross Apr 21 '14 at 09:35
@user3354853 I copy and paste your first text to a plain text file (created by Vim 7.3) using UTF8 encoding, after using that `open` pragma, `open` that file, read it line by line, then `print` it, then I get a exact copy of the original copy of the input file. (Perl 5.14.2) Could you offer a download link of your input file so we can test our answers with it? – Lee Duhem Apr 21 '14 at 09:36
@user3354853 Please recheck my updated answer, there is a mistake in the previous version. – Lee Duhem Apr 21 '14 at 09:39
Thanks Mr.Lee I also uploaded my txt file in the following link take a look at this : https://mega.co.nz/#!3k4SASIB!x2ZSd62-HowlfCiJjSxg_uqweR2OC6xMmjsoVdtFqvo – user3354853 Apr 21 '14 at 09:54
@user3354853 Looks like only using `open` pragma is not enough, you also need to add `-CS` option to perl interpreter. Please check out my updated answer. – Lee Duhem Apr 21 '14 at 10:23

scozy · Answer 2 · 2014-04-21T08:29:26.977

The fact that Perl handles your umlauts but not your dashes suggests that the file uses windows-1252 encoding. Perl is probably assuming that the file is in latin-1 (ISO-8859-1), an encoding that doesn't use codepoints between 80 and 9F. The N dash being 97 in windows-1252 would explain why Perl doesn't process it.

Try telling Perl to use windows-1252 both for files and for the terminal, with the open pragma:

use open qw( :encoding(windows-1252) :std );

How to read text file contents without loss of characters in perl

2 Answers2

Linked