Replacing HTML ascii codes via a bash script?

Question

I need a way to replace HTML ASCII codes like ! with their correct character in bash.

Is there a utility I could run my output through to do this, or something along those lines?

I guess it depends on how complicated the files are and how often you need to do it; on the small scale, I'd just open the file with a browser and copy/paste it out. — Carl Norum, Feb 14 '10 at 19:14
Carl: I can't open it in a browser because this is a background script designed to be used by GeekTool. Dennis: No, I'm stripping out just the description element from an RSS feed. — vilhalmer, Feb 14 '10 at 19:51

score 6 · Accepted Answer · answered Feb 14 '10 at 19:58

6

$ echo '&#33;' | recode html/..
!
$ echo '&lt;&infin;&gt;' | recode html/..
<∞>

answered Feb 14 '10 at 19:58

ephemient

198,619
38
280
391

ephemient, this is awesome! The only problem is that it isn't included with OS X, so I'll have to find a way to distribute it. – vilhalmer Feb 14 '10 at 20:25
3

An alternative is to pipe through a web browser -- such as `echo '!' | w3m -dump -T text/html` – user1686 Feb 14 '10 at 21:00
@SphereCat1 http://recode.darwinports.com/ http://pdb.finkproject.org/pdb/package.php/recode Don't forget to distribute GNU recode consistent with its license, GPL. @grawity Clever, but I don't think OS X comes with w3m or lynx either ;-) – ephemient Feb 14 '10 at 21:04

score 1 · Answer 2 · answered Feb 14 '10 at 19:07

I don't know of an easy way, here is what I suppose I would do...

You might be able to script a browser into reading the file in and then saving it as text. If lynx supports html character entities then it might be worth looking in to. If that doesn't work out...

The general solution to something like this is done with sed. You need a "higher order" edit for this, as you would first start with an entity table and then you would edit that table into an edit script itself with a multiple-step procedure. Something like:

. . .
s/&amp;Dagger;/&Dagger;/g<br />
s/&amp;#8221;/&#8221;/g<br />
. . .

Then, encapsulate this as html, read it in to a browser, and save it as text in the character set you are targeting. If you get it to produce lines like:

s/&lt;/</g

then you win. A bash script that calls sed or ex can be driven by the substitute commands in the file.

Alright, that's pretty much what I'm already doing, just manually adding each one to the script. I didn't know I could run sed with a scripting file, though, that's a useful bit of info! Thanks! — vilhalmer, Feb 14 '10 at 19:53
If you use this solution, make sure to put `s/&\|&\|&/\&/g` at end of the script; otherwise, if it's before another entry (say `s/!/!/g`), then `!` would get improperly translated to `!` instead of ``. — ephemient, Feb 14 '10 at 20:04

score 1 · Answer 3 · answered Sep 29 '15 at 21:58

Here is my solution with the standard Linux toolbox.

$ foo="This is a line feed&#010;And e acute:&#233; with a grinning face &#128512;."
$ echo "$foo"
This is a line feed&#010;And e acute:&#233; with a grinning face &#128512;.
$ eval "$(printf '%s' "$foo" | sed 's/^/printf "/;s/&#0*\([0-9]*\);/\$( [ \1 -lt 128 ] \&\& printf "\\\\$( printf \"%.3o\\201\" \1)" || \$(which printf) \\\\U\$( printf \"%.8x\" \1) )/g;s/$/\\n"/')" | sed "s/$(printf '\201')//g"
This is a line feed
And e acute:é with a grinning face .

You see that it works for all kinds of escapes, even Line Feed, e acute (é) which is a 2 byte UTF-8 and even the new emoticons which are in the extended plane (4 bytes unicode).

This command works ALSO with dash which is a trimmed down shell (default shell on Ubuntu) and is also compatible with bash and shells like ash used by the Synology.

If you don't mind sticking with bash and dropping the compatibility, you can make is much simpler.

Bits used should be in any decent Linux box (or OS X?) - which - printf (GNU and builtin) - GNU sed - eval (shell builtin)

The bash only version don't need which nor the GNU printf.

Replacing HTML ascii codes via a bash script?

3 Answers3

Linked