4

We have HTML source files which contain special characters encoded as &#nnnn; like in the word:

außergewöhnlich

We would like to convert them into plain UTF-8:

außergewöhnlich

Is there any small tool to do that?

Marcel Korpel
  • 21,536
  • 6
  • 60
  • 80
dagnelies
  • 5,203
  • 5
  • 38
  • 56
  • Looks like duplicate of http://stackoverflow.com/questions/57708/convert-xml-html-entities-into-unicode-string-in-python – uthark Jun 22 '10 at 16:37
  • @uthark – That question asks for a solution using Python, which is not necessary for this conversion (and is a bit like killing a mosquito using a cannon). To compare: in my distribution the installed size of Python is 63348 KB; the size of Lynx is ‘only’ 3770 KB. – Marcel Korpel Jun 22 '10 at 16:42
  • @Marcel Korpel Where are you getting these bull shit numbers? – Evan Carroll Jun 22 '10 at 18:34
  • @Evan – From my distribution's package manager (Arch Linux' Pacman). `pacman -Qi python` gives Python's installed size (among other things), `pacman -Qi lynx` does the same for Lynx, etc. – Marcel Korpel Jun 22 '10 at 19:08
  • fascinating, that is totally useless. – Evan Carroll Jun 22 '10 at 19:13
  • This is a one-time conversion of HTML source files that contain numeric character entity references to files that contain actual UTF-8 encoded characters, correct? Who cares how big the tool is? Evan's perl or uthark's ascii2uni seem like fine answers. – Stephen P Jun 22 '10 at 19:24

3 Answers3

4

You can do this with perl, and HTML::Entities if you wish.

echo 'echo 'außergewöhnlich' |
perl -MHTML::Entities -pe'binmode STDOUT, ":utf8"; HTML::Entities::decode_entities($_)'
Evan Carroll
  • 78,363
  • 46
  • 261
  • 468
3

I suppose ascii2uni tool will perform required conversion.

The size of the tool is about several hundreds kilobytes, it is smaller than lynx, mentioned above.

uthark
  • 5,333
  • 2
  • 43
  • 59
-1

Here is a full shell solution (apparently you don't specify the 'language' to be used).

foo='außergewöhnlich'
echo "$foo"

außergewöhnlich

eval "$(printf '%s' "$foo" | sed 's/^/printf "/;s/&#0*\([0-9]*\);/\$( [ \1 -lt 128 ] \&\& printf "\\\\$( printf \"%.3o\\201\" \1)" || \$(which printf) \\\\u\$( printf \"%.4x\" \1) )/g;s/$/\\n"/')" | sed "s/$(printf '\201')//g"

außergewöhnlich

Comment: this work ALSO with dash (used as standard shell for Ubuntu). We must use the GNU printf in some places because the builtin printf in dash does not know \u to convert to Unicode. Also, the GNU printf is kind of stupid, as it refuses to work with codepoints from 0 to 127 which are perfectly legal in UTF. Thus we have to make is conditionnal and use octal for the range 0-128. The last sed is used in case you need to convert characters like Line Feed ( ) or Tab ( ). We use a trick so that the command substition keeps these trailing characters, then we remove the "trick" with the last sed. The character used for that should NOT happen if your input conforms to Unicode, so it should be safe.

Zakhar
  • 29
  • 5