51

The box has no Ruby/Python/Perl etc.

Only bash, sed, and awk.

A way is to replace chars by map, but it becomes tedious.

Perhaps some built-in functionality i'm not aware of?

miken32
  • 42,008
  • 16
  • 111
  • 154
James Evans
  • 765
  • 1
  • 7
  • 11

8 Answers8

74

Escaping HTML really just involves replacing three characters: <, >, and &. For extra points, you can also replace " and '. So, it's not a long sed script:

sed 's/&/\&amp;/g; s/</\&lt;/g; s/>/\&gt;/g; s/"/\&quot;/g; s/'"'"'/\&#39;/g'
ruakh
  • 175,680
  • 26
  • 273
  • 307
  • 3
    +1 for elegance and efficiency. You should post your answer here: http://stackoverflow.com/questions/5929492/bash-script-to-convert-from-html-entities-to-characters where they recommend installing `recode`, `perl`, `php`, `xmlsarlet` and `w3m` (a web browser for crying out loud). The last answer recommends using Python3 which although installed by default (in Ubuntu at least) is overkill too. – WinEunuuchs2Unix Mar 26 '17 at 23:43
  • 1
    @WinEunuuchs2Unix: Thanks for your kind words! That question is asking about the opposite direction (`<` to `<`), and the answers there are trying to cover the possibility of random other entity references like `é` and numeric character references like `É`, rather than minimally-escaped HTML. For many purposes that might be overengineering, but on Stack Overflow it can be hard to tell exactly what someone's purpose is, so I don't blame the answerers there for wanting to provide something universal. – ruakh Mar 26 '17 at 23:58
  • @ruakh You're welcome :) Can't your sed search and replace simply be reversed to accomplish the same result as those answers? – WinEunuuchs2Unix Mar 27 '17 at 00:43
  • 1
    @WinEunuuchs2Unix: There are many ways to HTML-escape a given piece of text; for example, `<`, `<`, and `<` are all valid ways to escape `<`. My `sed` script only does one kind of HTML-escaping, since you only need one; but if you want to do HTML-*unescaping*, then either you need to handle all valid ways of escaping, or you need to know beforehand exactly what way of escaping was used. Do you see what I mean? – ruakh Mar 27 '17 at 01:21
  • Yes. My HTML-unescaping is limited to stack exchange site Ask Ubuntu and so far I've only noticed `&Amp;`, `$lt;` and `"`. The goal is to compare all the scripts on my drive I've published in Ask Ubuntu to see if they have been changed locally or revised by someone else in Ask Ubuntu. For fun I'm also extracting upvotes from the HTML file and putting it in the local file. This is the work in progress from a few nights ago: http://askubuntu.com/questions/894888/bash-template-to-use-zenity-or-yad-to-insert-edit-delete-records-in-a-file/896783#896783 – WinEunuuchs2Unix Mar 27 '17 at 01:31
  • Really useful. Same as a function: `escape_html() { sed $1 's/&/\&/g; s/\</g; s/>/\>/g; s/"/\"/g; s/'"'"'/\'/g'; }` – geotheory Nov 17 '18 at 10:06
14

You can use recode utility:

    echo 'He said: "Not sure that - 2<1"' | recode ascii..html

Output:

    He said: &quot;Not sure that - 2&lt;1&quot;
Ivan
  • 3,084
  • 4
  • 21
  • 22
  • 3
    Probably not available if there's no Python/Ruby/Perl. – tbodt Nov 19 '16 at 22:48
  • Tested on 30 or so textfiles containing ASCII and it even handles the null character `\0`. Use to sandbox textfile contents for `srcdoc` attribute of a sandboxed `iframe` in HTML and allow background styling via parent frame to cascade. – vhs May 06 '20 at 15:20
14

Pure bash, no external programs:

function htmlEscape () {
    local s
    s=${1//&/&amp;}
    s=${s//</&lt;}
    s=${s//>/&gt;}
    s=${s//'"'/&quot;}
    printf -- %s "$s"
}

Just simple string substitution.

Vladimir Panteleev
  • 24,651
  • 6
  • 70
  • 114
miken32
  • 42,008
  • 16
  • 111
  • 154
2

or use xmlstar Escape/Unescape special XML characters:

$ echo '<abc&def>'| xml esc
&lt;abc&amp;def&gt;
schemacs
  • 2,783
  • 8
  • 35
  • 53
2

I'm using jq:

$ echo "2 < 4 is 'TRUE'" | jq -Rr @html
2 &lt; 4 is &apos;TRUE&apos;
yegor256
  • 102,010
  • 123
  • 446
  • 597
1

This is an updated answer to miken32 "Pure bash, "no external programs":

bash 5.2 breaks backward compatibility in ways that are highly inconvenient.

From NEWS:

x. New shell option: patsub_replacement. When enabled, a '&' in the replacement string of the pattern substitution expansion is replaced by the portion of the string that matched the pattern. Backslash will escape the '&' and insert a literal '&'.

The option is enabled by default. If you want to restore the previous behavior, add shopt -u patsub_replacement.

So there is three ways to use miken32 code in bash 5.2+:

Either disable patsub_replacement:

shopt -u patsub_replacement
function htmlEscape () {
    local s
    s=${1//&/&amp;}
    s=${s//</&lt;}
    s=${s//>/&gt;}
    s=${s//'"'/&quot;}
    printf -- %s "$s"
}

, another option is to escape '&' with backslash in the replacement if you want to make it work regardless of the 5.2 feature, patsub_replacement:

function htmlEscape () {
    local s
    s=${1//&/\&amp;}
    s=${s//</\&lt;}
    s=${s//>/\&gt;}
    s=${s//'"'/\&quot;}
    printf -- %s "$s"
}

and another option is to quote string in the replacement:

function htmlEscape () {
    local s
    s=${1//&/"&amp;"}
    s=${s//</"&lt;"}
    s=${s//>/"&gt;"}
    s=${s//'"'/"&quot;"}
    printf -- %s "$s"
}
thierrybo
  • 41
  • 2
  • 5
-1

There's much better answers, but I just found this so I thought I'd share.

PN=`basename "$0"`          # Program name
VER=`echo '$Revision: 1.1 $' | cut -d' ' -f2`

Usage () {
    echo >&2 "$PN - encode HTML unsave characters, $VER
usage: $PN [file ...]"
    exit 1
}

set -- `getopt h "$@"`
while [ $# -gt 0 ]
do
    case "$1" in
    --) shift; break;;
    -h) Usage;;
    -*) Usage;;
    *)  break;;         # First file name
    esac
    shift
done

sed                                     \
    -e 's/&/\&amp;/g'                       \
    -e 's/"/\&quot;/g'                      \
    -e 's/</\&lt;/g'                        \
    -e 's/>/\&gt;/g'                        \
    -e 's/„/\&auml;/g'                      \
    -e 's/Ž/\&Auml;/g'                      \
    -e 's/”/\&ouml;/g'                      \
    -e 's/™/\&Ouml;/g'                      \
    -e 's//\&uuml;/g'                       \
    -e 's/š/\&Uuml;/g'                      \
    -e 's/á/\&szlig;/g'                     \
    "$@"
WaXxX333
  • 388
  • 1
  • 2
  • 11
  • Most of these replacements are very wrong, and entities for Unicode characters are generally not needed any longer. Even so, there are hundreds of entities defined in HTML; why have you chosen these half dozen to incorrectly replace?? – miken32 Nov 03 '22 at 17:25
  • @miken32 `"but I just found this so I thought I'd share"` I found it while looking for resolutions, I didn't write it. It worked for me so I shared it. – WaXxX333 Nov 04 '22 at 01:38
-5

The previous sed replacement defaces valid output like

&lt;

into

&amp;lt;

Adding a negative loook-ahead so "&" is only changed into "&amp;" if that "&" isn't already followed by "amp;" fixes that:

sed 's/&(?!amp;)/\&amp;/g; s/</\&lt;/g; s/>/\&gt;/g; s/"/\&quot;/g; s/'"'"'/\&#39;/g'
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
nachtgeist
  • 39
  • 1
  • 8
    Big mistake. When I HTML-encode a string `&`, it is because I want it to be rendered by some web browser as `&`. That is why it **must** be turned into `&amp;`. That way, HTML-encoding and HTML-decoding are in balance. You don't suppress HTML-encoding just because the input _looks like_ it has already been HTML-encoded. HTML-encoding is **not** idempotent. Failure to grasp that, eventually leads to XSS vulnerabilities. – Ruud Helderman Nov 10 '15 at 20:47
  • 1
    @Ruud is right; the right way to accomplish this is to escape ampersands first, like in ruakh's answer. – Brian McCutchon Jan 14 '16 at 21:07
  • 3
    I totally agree with what @Ruud said except that he should have emphasized **failure to grasp that leads to XSS vulnerabilities** – kmkaplan Feb 01 '17 at 08:32