In Bash (on Ubuntu), is there a command which removes invalid multibyte (non-ASCII) characters?
I've tried perl -pe 's/[^[:print:]]//g'
but it also removes all valid non-ASCII characters.
I can use sed
, awk
or similar utilities if needed.
In Bash (on Ubuntu), is there a command which removes invalid multibyte (non-ASCII) characters?
I've tried perl -pe 's/[^[:print:]]//g'
but it also removes all valid non-ASCII characters.
I can use sed
, awk
or similar utilities if needed.
The problem is that Perl does not realize that your input is UTF-8; it assumes it's operating on a stream of bytes. You can use the -CI
flag to tell it to interpret the input as UTF-8. And, since you will then have multibyte characters in your output, you will also need to tell Perl to use UTF-8 in writing to standard output, which you can do by using the -CO
flag. So:
perl -CIO -pe 's/[^[:print:]]//g'
If you want a simpler alternative to Perl, try iconv
as follows:
iconv -c <<<$'Mot\x{fc}rhead' # -> 'Motrhead'
-f
(e.g., -f UTF8
); the output encoding with -t
(e.g., -t UTF8
) - run iconv -l
to see all supported encodings.-c
simply discards input chars. that aren't valid in the input encoding; in the example, \x{fc}
is the single-byte LATIN1 (ISO8859-1) representation of ö
, which is invalid in UTF8 (where it's represented as \x{c3}\x{b6}
).Note (after discovering a comment by the OP): If your output still contains garbled characters:
"� (question mark) or (box with hex numbers in it)"
the implication is indeed that the cleaned-up string contains - valid - UTF-8 characters that the font being used doesn't support.