Remove invalid non-ASCII characters in Bash

Question

In Bash (on Ubuntu), is there a command which removes invalid multibyte (non-ASCII) characters?

I've tried perl -pe 's/[^[:print:]]//g' but it also removes all valid non-ASCII characters.

I can use sed, awk or similar utilities if needed.

This may be relevant: http://stackoverflow.com/questions/115210/utf-8-validation — ooga, Jun 22 '14 at 17:54
What do you consider invalid? For example, is an unassigned codepoint valid or invalid? — rici, Jun 22 '14 at 17:56
@professorfish: In that case, you'll need to use a unicode-aware regex engine which understands unicode general category codes. You'll want to remove general categories Cn (unassigned), Cs (surrogate), and probably Co (private use). `Cn` is relative to a particular Unicode version; in future versions Cn codepoints may be assigned (aside from the 66 non-characters). — rici, Jun 22 '14 at 18:24
@ooga I tried `iconv -c`, it removes most of the invalid characters but some still display as � (question mark) or ߻ (box with hex numbers in it). Is it just because they aren't in my font? — , Jun 22 '14 at 18:26

score 5 · Accepted Answer · answered Jun 22 '14 at 19:19

The problem is that Perl does not realize that your input is UTF-8; it assumes it's operating on a stream of bytes. You can use the -CI flag to tell it to interpret the input as UTF-8. And, since you will then have multibyte characters in your output, you will also need to tell Perl to use UTF-8 in writing to standard output, which you can do by using the -CO flag. So:

perl -CIO -pe 's/[^[:print:]]//g'

mklement0 · Answer 2 · 2014-06-22T20:15:34.817

If you want a simpler alternative to Perl, try iconv as follows:

iconv -c <<<$'Mot\x{fc}rhead'  # -> 'Motrhead'

Both the input and output encodings default to UTF-8, but can be specified explicitly: the input encoding with -f (e.g., -f UTF8); the output encoding with -t (e.g., -t UTF8) - run iconv -l to see all supported encodings.
-c simply discards input chars. that aren't valid in the input encoding; in the example, \x{fc} is the single-byte LATIN1 (ISO8859-1) representation of ö, which is invalid in UTF8 (where it's represented as \x{c3}\x{b6}).

Note (after discovering a comment by the OP): If your output still contains garbled characters:

"� (question mark) or ߻ (box with hex numbers in it)"

the implication is indeed that the cleaned-up string contains - valid - UTF-8 characters that the font being used doesn't support.

Remove invalid non-ASCII characters in Bash

2 Answers2