Short way to escape HTML in Bash?

Question

The box has no Ruby/Python/Perl etc.

Only bash, sed, and awk.

A way is to replace chars by map, but it becomes tedious.

Perhaps some built-in functionality i'm not aware of?

score 74 · Accepted Answer · answered Oct 13 '12 at 13:49

74

Escaping HTML really just involves replacing three characters: <, >, and &. For extra points, you can also replace " and '. So, it's not a long sed script:

sed 's/&/\&amp;/g; s/</\&lt;/g; s/>/\&gt;/g; s/"/\&quot;/g; s/'"'"'/\&#39;/g'

answered Oct 13 '12 at 13:49

ruakh

175,680
26
273
307

3

+1 for elegance and efficiency. You should post your answer here: http://stackoverflow.com/questions/5929492/bash-script-to-convert-from-html-entities-to-characters where they recommend installing `recode`, `perl`, `php`, `xmlsarlet` and `w3m` (a web browser for crying out loud). The last answer recommends using Python3 which although installed by default (in Ubuntu at least) is overkill too. – WinEunuuchs2Unix Mar 26 '17 at 23:43
1

@WinEunuuchs2Unix: Thanks for your kind words! That question is asking about the opposite direction (`<` to `<`), and the answers there are trying to cover the possibility of random other entity references like `é` and numeric character references like `É`, rather than minimally-escaped HTML. For many purposes that might be overengineering, but on Stack Overflow it can be hard to tell exactly what someone's purpose is, so I don't blame the answerers there for wanting to provide something universal. – ruakh Mar 26 '17 at 23:58
@ruakh You're welcome :) Can't your sed search and replace simply be reversed to accomplish the same result as those answers? – WinEunuuchs2Unix Mar 27 '17 at 00:43
1

@WinEunuuchs2Unix: There are many ways to HTML-escape a given piece of text; for example, `<`, `<`, and `<` are all valid ways to escape `<`. My `sed` script only does one kind of HTML-escaping, since you only need one; but if you want to do HTML-*unescaping*, then either you need to handle all valid ways of escaping, or you need to know beforehand exactly what way of escaping was used. Do you see what I mean? – ruakh Mar 27 '17 at 01:21
Yes. My HTML-unescaping is limited to stack exchange site Ask Ubuntu and so far I've only noticed `&Amp;`, `$lt;` and `"`. The goal is to compare all the scripts on my drive I've published in Ask Ubuntu to see if they have been changed locally or revised by someone else in Ask Ubuntu. For fun I'm also extracting upvotes from the HTML file and putting it in the local file. This is the work in progress from a few nights ago: http://askubuntu.com/questions/894888/bash-template-to-use-zenity-or-yad-to-insert-edit-delete-records-in-a-file/896783#896783 – WinEunuuchs2Unix Mar 27 '17 at 01:31
Really useful. Same as a function: `escape_html() { sed $1 's/&/\&/g; s/\</g; s/>/\>/g; s/"/\"/g; s/'"'"'/\'/g'; }` – geotheory Nov 17 '18 at 10:06

score 14 · Answer 2 · answered Jan 27 '16 at 14:36

14

You can use recode utility:

    echo 'He said: "Not sure that - 2<1"' | recode ascii..html

Output:

    He said: &quot;Not sure that - 2&lt;1&quot;

answered Jan 27 '16 at 14:36

Ivan

3,084
4
21
22

3

Probably not available if there's no Python/Ruby/Perl. – tbodt Nov 19 '16 at 22:48
Tested on 30 or so textfiles containing ASCII and it even handles the null character `\0`. Use to sandbox textfile contents for `srcdoc` attribute of a sandboxed `iframe` in HTML and allow background styling via parent frame to cascade. – vhs May 06 '20 at 15:20

score 14 · Answer 3 · edited Dec 19 '19 at 18:42

14

Pure bash, no external programs:

function htmlEscape () {
    local s
    s=${1//&/&amp;}
    s=${s//</&lt;}
    s=${s//>/&gt;}
    s=${s//'"'/&quot;}
    printf -- %s "$s"
}

Just simple string substitution.

edited Dec 19 '19 at 18:42

Vladimir Panteleev

24,651
6
70
114

answered Sep 29 '18 at 16:45

miken32

42,008
16
111
154

score 2 · Answer 4 · answered Jun 07 '19 at 09:04

2

or use xmlstar Escape/Unescape special XML characters:

$ echo '<abc&def>'| xml esc
&lt;abc&amp;def&gt;

answered Jun 07 '19 at 09:04

schemacs

2,783
8
35
53

I want to try this but I don't know how to install xml esc. I don't even know what it is. Could you elaborate? – Ohiovr Dec 30 '19 at 11:25
Just `brew install xmlstarlet` if you are using MacOS. – schemacs Dec 30 '19 at 11:37

score 2 · Answer 5 · answered Feb 20 '22 at 05:42

2

I'm using jq:

$ echo "2 < 4 is 'TRUE'" | jq -Rr @html
2 &lt; 4 is &apos;TRUE&apos;

answered Feb 20 '22 at 05:42

yegor256

102,010
123
446
597

score 1 · Answer 6 · answered Nov 03 '22 at 19:36

This is an updated answer to miken32 "Pure bash, "no external programs":

bash 5.2 breaks backward compatibility in ways that are highly inconvenient.

From NEWS:

x. New shell option: patsub_replacement. When enabled, a '&' in the replacement string of the pattern substitution expansion is replaced by the portion of the string that matched the pattern. Backslash will escape the '&' and insert a literal '&'.

The option is enabled by default. If you want to restore the previous behavior, add shopt -u patsub_replacement.

So there is three ways to use miken32 code in bash 5.2+:

Either disable patsub_replacement:

shopt -u patsub_replacement
function htmlEscape () {
    local s
    s=${1//&/&amp;}
    s=${s//</&lt;}
    s=${s//>/&gt;}
    s=${s//'"'/&quot;}
    printf -- %s "$s"
}

, another option is to escape '&' with backslash in the replacement if you want to make it work regardless of the 5.2 feature, patsub_replacement:

function htmlEscape () {
    local s
    s=${1//&/\&amp;}
    s=${s//</\&lt;}
    s=${s//>/\&gt;}
    s=${s//'"'/\&quot;}
    printf -- %s "$s"
}

and another option is to quote string in the replacement:

function htmlEscape () {
    local s
    s=${1//&/"&amp;"}
    s=${s//</"&lt;"}
    s=${s//>/"&gt;"}
    s=${s//'"'/"&quot;"}
    printf -- %s "$s"
}

score -1 · Answer 7 · answered Oct 23 '22 at 03:15

There's much better answers, but I just found this so I thought I'd share.

PN=`basename "$0"`          # Program name
VER=`echo '$Revision: 1.1 $' | cut -d' ' -f2`

Usage () {
    echo >&2 "$PN - encode HTML unsave characters, $VER
usage: $PN [file ...]"
    exit 1
}

set -- `getopt h "$@"`
while [ $# -gt 0 ]
do
    case "$1" in
    --) shift; break;;
    -h) Usage;;
    -*) Usage;;
    *)  break;;         # First file name
    esac
    shift
done

sed                                     \
    -e 's/&/\&amp;/g'                       \
    -e 's/"/\&quot;/g'                      \
    -e 's/</\&lt;/g'                        \
    -e 's/>/\&gt;/g'                        \
    -e 's/„/\&auml;/g'                      \
    -e 's/Ž/\&Auml;/g'                      \
    -e 's/”/\&ouml;/g'                      \
    -e 's/™/\&Ouml;/g'                      \
    -e 's//\&uuml;/g'                       \
    -e 's/š/\&Uuml;/g'                      \
    -e 's/á/\&szlig;/g'                     \
    "$@"

Most of these replacements are very wrong, and entities for Unicode characters are generally not needed any longer. Even so, there are hundreds of entities defined in HTML; why have you chosen these half dozen to incorrectly replace?? — miken32, Nov 03 '22 at 17:25
@miken32 `"but I just found this so I thought I'd share"` I found it while looking for resolutions, I didn't write it. It worked for me so I shared it. — WaXxX333, Nov 04 '22 at 01:38

score -5 · Answer 8 · edited May 02 '19 at 20:35

-5

The previous sed replacement defaces valid output like

&lt;

into

&amp;lt;

Adding a negative loook-ahead so "&" is only changed into "&" if that "&" isn't already followed by "amp;" fixes that:

sed 's/&(?!amp;)/\&amp;/g; s/</\&lt;/g; s/>/\&gt;/g; s/"/\&quot;/g; s/'"'"'/\&#39;/g'

edited May 02 '19 at 20:35

Jean-François Fabre

137,073
23
153
219

answered Mar 07 '15 at 12:51

nachtgeist

39
1

8

Big mistake. When I HTML-encode a string `&`, it is because I want it to be rendered by some web browser as `&`. That is why it **must** be turned into `&`. That way, HTML-encoding and HTML-decoding are in balance. You don't suppress HTML-encoding just because the input _looks like_ it has already been HTML-encoded. HTML-encoding is **not** idempotent. Failure to grasp that, eventually leads to XSS vulnerabilities. – Ruud Helderman Nov 10 '15 at 20:47
1

@Ruud is right; the right way to accomplish this is to escape ampersands first, like in ruakh's answer. – Brian McCutchon Jan 14 '16 at 21:07
3

I totally agree with what @Ruud said except that he should have emphasized **failure to grasp that leads to XSS vulnerabilities** – kmkaplan Feb 01 '17 at 08:32

Short way to escape HTML in Bash?

8 Answers8

Linked