We have HTML source files which contain special characters encoded as &#nnnn;
like in the word:
außergewöhnlich
We would like to convert them into plain UTF-8:
außergewöhnlich
Is there any small tool to do that?
We have HTML source files which contain special characters encoded as &#nnnn;
like in the word:
außergewöhnlich
We would like to convert them into plain UTF-8:
außergewöhnlich
Is there any small tool to do that?
You can do this with perl, and HTML::Entities
if you wish.
echo 'echo 'außergewöhnlich' |
perl -MHTML::Entities -pe'binmode STDOUT, ":utf8"; HTML::Entities::decode_entities($_)'
Here is a full shell solution (apparently you don't specify the 'language' to be used).
foo='außergewöhnlich'
echo "$foo"
außergewöhnlich
eval "$(printf '%s' "$foo" | sed 's/^/printf "/;s/�*\([0-9]*\);/\$( [ \1 -lt 128 ] \&\& printf "\\\\$( printf \"%.3o\\201\" \1)" || \$(which printf) \\\\u\$( printf \"%.4x\" \1) )/g;s/$/\\n"/')" | sed "s/$(printf '\201')//g"
außergewöhnlich
Comment: this work ALSO with dash (used as standard shell for Ubuntu). We must use the GNU printf in some places because the builtin printf in dash does not know \u to convert to Unicode. Also, the GNU printf is kind of stupid, as it refuses to work with codepoints from 0 to 127 which are perfectly legal in UTF. Thus we have to make is conditionnal and use octal for the range 0-128. The last sed is used in case you need to convert characters like Line Feed ( ) or Tab ( ). We use a trick so that the command substition keeps these trailing characters, then we remove the "trick" with the last sed. The character used for that should NOT happen if your input conforms to Unicode, so it should be safe.