7

How can I find extended ASCII characters in a file using Perl? Can anyone get the script?

.....thanks in advance.....

brian d foy
  • 129,424
  • 31
  • 207
  • 592
User1611
  • 1,081
  • 4
  • 18
  • 27

6 Answers6

10

Since the extended ASCII characters have value 128 and higher, you can just call ord on individual characters and handle those with a value >= 128. The following code reads from stdin and prints only the extended ASCII characters:

while (<>) {
  while (/(.)/g) {
    print($1) if (ord($1) >= 128);
  }
}

Alternatively, unpack together with chr will also work. Example:

while (<>) {
  foreach (unpack("C*", $_)) {
    print(chr($_)) if ($_ >= 128);
  }
}

(I'm sure some Perl guru can condense both of these to two one-liners...)


To print the line numbers instead, you can use the following (this does not remove duplicates, and will have odd behaviour when unicode is passed):

while (<>) {
  while (/(.)/g) {
    print($. . "\n") if (ord($1) >= 128);
  }
}

(Thanks Yaakov Belch for the $. tip.)

Community
  • 1
  • 1
Stephan202
  • 59,965
  • 13
  • 127
  • 133
  • It is very slow and ineffective approach, see Dave Sherohman's solution http://stackoverflow.com/questions/881931/how-to-print-numbers-of-line-containing-extended-ascii-characters-in-perl/882113#882113 It is far faster and simpler. – Hynek -Pichi- Vychodil May 19 '09 at 12:11
  • This answer was posted before Dave's. I have seen Dave's approach, and it is to be preferred in most instances. This just shows that I'm a Perl novice. I choose not to delete this answer because the last part appears to do exactly what the questioner wants. Also see http://stackoverflow.com/questions/882122/reading-a-file-char-by-char-and-checking-for-extented-ascii-char – Stephan202 May 19 '09 at 12:24
  • ...ah, that page has been deleted. Suffice it to say, the question stated that the line number should be printed for *each* extended ASCII character. This is what my solution does. – Stephan202 May 19 '09 at 12:26
8

The first printable ASCII character is space (32). The last printable ASCII character is ~ (126). So I'd probably use

while (<>) {
  print "$.\n" if /[^ -~]/;
}

although it will, admittedly, also display lines containing control characters as well as extended ASCII.

Edit: Changed to print the line number rather than the line itself.

Dave Sherohman
  • 45,363
  • 14
  • 64
  • 102
  • 1
    It's easy to print the line number instead of the line: while(<>) { print "$.\n" if /[^ -~]/;} This should solve the stated problem – Yaakov Belch May 19 '09 at 11:23
  • Whoops! I was just reading the question itself and missed that the title specified that he wanted the line number. Thanks for the catch. – Dave Sherohman May 19 '09 at 11:27
5

Oneliner:

perl -nE'say$.if/[\xE0-\xFF]/'

for older perl versions

perl -lne'print$.if/[\xE0-\xFF]/'
Hynek -Pichi- Vychodil
  • 26,174
  • 5
  • 52
  • 73
2

Hynek -Pichi- Vychodil's answer:

perl -nE'say$.if/[\xE0-\xFF]/'

only tests a limited part of the non-printing should presumably be

perl -nE'say$.if/[\x80-\xFF]/'

instead.

2

A crucial question is whether the

use bytes;

pragma should be in effect. The poster should decide that. For picking characters with codes greater than 127, the following will suffice:

print grep 127 < ord, split // while <>;

or

print grep /[^[:ascii:]]/, split // while <>;
Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
1

What about grep?

grep [\x00-\x1F\x7F-\xFF]+ *