Unescaping HTML entities (&#nnnn;) into plain UTF-8

Question

We have HTML source files which contain special characters encoded as &#nnnn; like in the word:

außergewöhnlich

We would like to convert them into plain UTF-8:

außergewöhnlich

Is there any small tool to do that?

Looks like duplicate of http://stackoverflow.com/questions/57708/convert-xml-html-entities-into-unicode-string-in-python — uthark, Jun 22 '10 at 16:37
@uthark – That question asks for a solution using Python, which is not necessary for this conversion (and is a bit like killing a mosquito using a cannon). To compare: in my distribution the installed size of Python is 63348 KB; the size of Lynx is ‘only’ 3770 KB. — Marcel Korpel, Jun 22 '10 at 16:42
@Marcel Korpel Where are you getting these bull shit numbers? — Evan Carroll, Jun 22 '10 at 18:34
@Evan – From my distribution's package manager (Arch Linux' Pacman). `pacman -Qi python` gives Python's installed size (among other things), `pacman -Qi lynx` does the same for Lynx, etc. — Marcel Korpel, Jun 22 '10 at 19:08
This is a one-time conversion of HTML source files that contain numeric character entity references to files that contain actual UTF-8 encoded characters, correct? Who cares how big the tool is? Evan's perl or uthark's ascii2uni seem like fine answers. — Stephen P, Jun 22 '10 at 19:24

score 4 · Answer 1 · answered Jun 22 '10 at 17:01

4

You can do this with perl, and HTML::Entities if you wish.

echo 'echo 'au&#223;ergew&#246;hnlich' |
perl -MHTML::Entities -pe'binmode STDOUT, ":utf8"; HTML::Entities::decode_entities($_)'

answered Jun 22 '10 at 17:01

Evan Carroll

78,363
46
261
468

Again, shooting a mosquito… Perl occupies 45796 KB here. – Marcel Korpel Jun 22 '10 at 17:34
But this works, and `lynx -dump` fails. – Stephen P Jun 22 '10 at 17:48
Perl occupies 45meg? That sounds like a mighty ridiculous claim. The binary is 1.2MB. Running that there is 2.3MB resident. – Evan Carroll Jun 22 '10 at 18:28

score 3 · Accepted Answer · answered Jun 22 '10 at 16:55

3

I suppose ascii2uni tool will perform required conversion.

The size of the tool is about several hundreds kilobytes, it is smaller than lynx, mentioned above.

answered Jun 22 '10 at 16:55

uthark

5,333
2
43
59

Zakhar · Answer 3 · 2015-09-29T21:37:29.220

Here is a full shell solution (apparently you don't specify the 'language' to be used).

foo='au&#223;ergew&#246;hnlich'
echo "$foo"

au&#223;ergew&#246;hnlich

eval "$(printf '%s' "$foo" | sed 's/^/printf "/;s/&#0*\([0-9]*\);/\$( [ \1 -lt 128 ] \&\& printf "\\\\$( printf \"%.3o\\201\" \1)" || \$(which printf) \\\\u\$( printf \"%.4x\" \1) )/g;s/$/\\n"/')" | sed "s/$(printf '\201')//g"

außergewöhnlich

Comment: this work ALSO with dash (used as standard shell for Ubuntu). We must use the GNU printf in some places because the builtin printf in dash does not know \u to convert to Unicode. Also, the GNU printf is kind of stupid, as it refuses to work with codepoints from 0 to 127 which are perfectly legal in UTF. Thus we have to make is conditionnal and use octal for the range 0-128. The last sed is used in case you need to convert characters like Line Feed ( ) or Tab ( ). We use a trick so that the command substition keeps these trailing characters, then we remove the "trick" with the last sed. The character used for that should NOT happen if your input conforms to Unicode, so it should be safe.

Unescaping HTML entities (&#nnnn;) into plain UTF-8

3 Answers3