2

In Bash (on Ubuntu), is there a command which removes invalid multibyte (non-ASCII) characters?

I've tried perl -pe 's/[^[:print:]]//g' but it also removes all valid non-ASCII characters.

I can use sed, awk or similar utilities if needed.

  • What is a multibyte character? – Avinash Raj Jun 22 '14 at 17:43
  • @AvinashRaj I meant non-ASCII, I'll edit the post –  Jun 22 '14 at 17:44
  • This may be relevant: http://stackoverflow.com/questions/115210/utf-8-validation – ooga Jun 22 '14 at 17:54
  • 2
    What do you consider invalid? For example, is an unassigned codepoint valid or invalid? – rici Jun 22 '14 at 17:56
  • 1
    So how's `iconv` working for you? – ooga Jun 22 '14 at 18:07
  • @professorfish: In that case, you'll need to use a unicode-aware regex engine which understands unicode general category codes. You'll want to remove general categories Cn (unassigned), Cs (surrogate), and probably Co (private use). `Cn` is relative to a particular Unicode version; in future versions Cn codepoints may be assigned (aside from the 66 non-characters). – rici Jun 22 '14 at 18:24
  • @ooga I tried `iconv -c`, it removes most of the invalid characters but some still display as � (question mark) or ߻ (box with hex numbers in it). Is it just because they aren't in my font? –  Jun 22 '14 at 18:26

2 Answers2

5

The problem is that Perl does not realize that your input is UTF-8; it assumes it's operating on a stream of bytes. You can use the -CI flag to tell it to interpret the input as UTF-8. And, since you will then have multibyte characters in your output, you will also need to tell Perl to use UTF-8 in writing to standard output, which you can do by using the -CO flag. So:

perl -CIO -pe 's/[^[:print:]]//g'
ruakh
  • 175,680
  • 26
  • 273
  • 307
0

If you want a simpler alternative to Perl, try iconv as follows:

iconv -c <<<$'Mot\x{fc}rhead'  # -> 'Motrhead'
  • Both the input and output encodings default to UTF-8, but can be specified explicitly: the input encoding with -f (e.g., -f UTF8); the output encoding with -t (e.g., -t UTF8) - run iconv -l to see all supported encodings.
  • -c simply discards input chars. that aren't valid in the input encoding; in the example, \x{fc} is the single-byte LATIN1 (ISO8859-1) representation of ö, which is invalid in UTF8 (where it's represented as \x{c3}\x{b6}).

Note (after discovering a comment by the OP): If your output still contains garbled characters:

"� (question mark) or ߻ (box with hex numbers in it)"

the implication is indeed that the cleaned-up string contains - valid - UTF-8 characters that the font being used doesn't support.

mklement0
  • 382,024
  • 64
  • 607
  • 775