Bash script to convert from HTML entities to characters

Question

I'm looking for a way to turn this:

hello &lt; world

to this:

hello < world

I could use sed, but how can this be accomplished without using cryptic regex?

score 109 · Answer 1 · edited May 23 '16 at 20:25

109

Try recode (archived page; GitHub mirror; Debian page):

$ echo '&lt;' |recode html..ascii
<

Install on Linux and similar Unix-y systems:

$ sudo apt-get install recode

Install on Mac OS using:

$ brew install recode

edited May 23 '16 at 20:25

Dave Jarvis

30,436
41
178
315

answered May 08 '11 at 18:49

ceving

21,900
13
104
178

1

why do I get "recode: Untranslatable input in step `ANSI_X3.4-1968..ISO-10646-UCS-2'" when I try this the opposite way? – Sebastian Heyn Aug 24 '16 at 14:35
4

Just my 2 cents -- I convert XML encoded in UTF-8 and I use: recode xml..utf8 – bubak Nov 14 '16 at 15:35
2

Good for HTML entities but messes up emoji: `echo '' | recode html..UTF-8` gives `ð`. The Perl method retains them. – Hugo Nov 24 '16 at 06:18
1

@Hugo If you encode your emoji in proper HTML, it will not be messed. – ceving Jan 12 '17 at 09:15
BTW to run this in place in a file, do `recode -i html..UTF-8 $file` – dessalines Oct 06 '18 at 15:18
1

This is good, however it mangles unicode (`& ٱلْعَرَبِيَّة` becomes `& Ù±ÙÙØ¹ÙØ±ÙØ¨ÙÙÙÙØ©`). The `perl` approach is preferred in situations like these. – leetbacoon Nov 30 '19 at 18:47
4

`recode` assumes that HTML is encoded in ISO-8859 and will therefore break UTF-8: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=748984 – Martin von Wittich Jun 25 '21 at 13:38
1

`html..utf8` completely messes up basic already-encoded characters like `’` and `html...ascii` can be a mess too. user1788934’s answer with Perl is much better and doesn’t need anything to be installed. – Yatharth Agarwal Jan 06 '22 at 11:00
@YatharthAgarwal `recode` supports just HTML 4. Use `recode -l` to get a list of supported encodings. My answer is from 2011. HTML5 is from 2014. – ceving Jan 06 '22 at 11:45

score 70 · Answer 2 · edited Jul 21 '23 at 09:29

70

With perl:

cat foo.html | perl -MHTML::Entities -pe 'decode_entities($_);'

With php from the command line:

cat foo.html | php -r 'while(($line=fgets(STDIN)) !== FALSE) echo html_entity_decode($line, ENT_QUOTES|ENT_HTML401);'

edited Jul 21 '23 at 09:29

Benjamin Loison

3,782
4
16
33

answered Oct 31 '12 at 15:45

user1788934

709
5
2

2

The PHP one is not working for certain characters such as ` ` – Romain Paulus Dec 20 '13 at 05:13
11

Shorter Perl version: `perl -MHTML::Entities -pe 'decode_entities($_);'` – RobEarl Aug 07 '14 at 08:48
6

I'll give you an upvote if you remove the useless use of cat (https://en.wikipedia.org/wiki/Cat_(Unix)#Useless_use_of_cat) :-) – 0x89 Aug 19 '14 at 09:10
11

Use `perl -C -MHTML::Entities -pe 'decode_entities($_);' < foo.html` to output UTF-8 (see [this question](http://stackoverflow.com/questions/627661/how-can-i-output-utf-8-from-perl)) – tricasse Oct 02 '15 at 09:15
3

'cat is useless' comments are ill-considered. The reprimanded user may for example have been doing something like 'zcat FILE.gz | just two command-lines before the current line, and 'gunzip FILE.gz' one command-line before the current. With history and readline, that user can now hit UPARROW twice, then hit HOME, delete one character (the 'z'), and hit ENTER to run the command that "cat useless" hecklers abhor. Ergo: 'cat is useless' comments are often less keen than they are clueless. – Oct 11 '19 at 13:53
The Perl solution works even in Git Bash on Windows where `recode` is not available. – Palec Jul 15 '22 at 09:56

score 23 · Answer 3 · edited Apr 09 '21 at 17:55

23

An alternative is to pipe through a web browser -- such as:

echo '!' | w3m -dump -T text/html

This worked great for me in cygwin, where downloading and installing distributions are difficult.

This answer was found here

edited Apr 09 '21 at 17:55

tukusejssirs

564
1
7
29

answered May 26 '11 at 15:50

Whitecat

3,882
7
48
78

score 20 · Answer 4 · answered May 09 '11 at 10:47

20

Using xmlstarlet:

echo 'hello &lt; world' | xmlstarlet unesc

answered May 09 '11 at 10:47

user243

217
1
2

7

Note that this does not work for hexa entities like `:`. – v6ak Aug 13 '13 at 21:00
3

It also fails for " – user100464 Mar 06 '18 at 20:43

score 19 · Answer 5 · edited Jul 21 '23 at 09:30

19

A python 3.2+ version:

cat foo.html | python3 -c 'import html, sys; [print(html.unescape(l), end="") for l in sys.stdin]'

edited Jul 21 '23 at 09:30

Benjamin Loison

3,782
4
16
33

answered Mar 08 '17 at 13:41

Aissen

1,500
1
11
17

How to make this have effect in file? I mean, to replace in file? – Sigur Nov 12 '19 at 18:15
Save the output directly on the file just add `[...] > foo.html` to the end of the command – Danton Heuer Sep 28 '22 at 12:17
1

Note that this tries to load lines into memory, and for each processed line, it adds an empty entry to an array. If you don't have long lines, replacing the `[... for x in y]` syntax with a regular `for x in y:\n\t...` loop will work. – Luc Nov 25 '22 at 11:23

score 16 · Answer 6 · edited Jul 21 '23 at 09:30

This answer is based on: Short way to escape HTML in Bash? which works fine for grabbing answers (using wget) on Stack Exchange and converting HTML to regular ASCII characters:

sed 's/&nbsp;/ /g; s/&amp;/\&/g; s/&lt;/\</g; s/&gt;/\>/g; s/&quot;/\"/g; s/#&#39;/\'"'"'/g; s/&ldquo;/\"/g; s/&rdquo;/\"/g;'

Edit 1: April 7, 2017 - Added left double quote and right double quote conversion. This is part of bash script that web-scrapes SE answers and compares them to local code files here: Ask Ubuntu - Code Version Control between local files and Ask Ubuntu answers

Edit June 26, 2017

Using sed was taking ~3 seconds to convert HTML to ASCII on a 1K line file from Ask Ubuntu / Stack Exchange. As such I was forced to use Bash built-in search and replace for ~1 second response time.

Here's the function:

LineOut=""      # Make global
HTMLtoText () {
    LineOut=$1  # Parm 1= Input line
    # Replace external command: Line=$(sed 's/&amp;/\&/g; s/&lt;/\</g; 
    # s/&gt;/\>/g; s/&quot;/\"/g; s/&#39;/\'"'"'/g; s/&ldquo;/\"/g; 
    # s/&rdquo;/\"/g;' <<< "$Line") -- With faster builtin commands.
    LineOut="${LineOut//&nbsp;/ }"
    LineOut="${LineOut//&amp;/&}"
    LineOut="${LineOut//&lt;/<}"
    LineOut="${LineOut//&gt;/>}"
    LineOut="${LineOut//&quot;/'"'}"
    LineOut="${LineOut//&#39;/"'"}"
    LineOut="${LineOut//&ldquo;/'"'}" # TODO: ASCII/ISO for opening quote
    LineOut="${LineOut//&rdquo;/'"'}" # TODO: ASCII/ISO for closing quote
} # HTMLtoText ()

On macOS Mohave with the default bash / sed, the `HTMLtoText` function wasn't doing the right thing with quotes (it would emit e.g. `'"'`) and I couldn't get it to work, The original sed function worked correctly, though. — Nathan Arthur, Mar 09 '19 at 21:32
It may work for specific use cases, but it is not a general solution. For example `<` is not unescaped into `<`. — unagi, Sep 03 '19 at 05:36
The replacement of `&` must be the last replacement. Test with input of `'<'`. The result should be `<` not `<`. Both the sed and the bash solution offered have this problem. — Robin A. Meade, Nov 22 '22 at 21:17

score 6 · Answer 7 · answered Oct 30 '21 at 05:20

6

On macOS, you can use the built-in command textutil (which is a handy utility in general):

echo '&#128075; hello &lt; world &#x1f310;' | textutil -convert txt -format html -stdin -stdout

outputs:

 hello < world

answered Oct 30 '21 at 05:20

Minh Nguyễn

861
7
18

2

Just a note that you may need to provide a hint to `textutil` about input encoding for it to work correctly. e.g. `echo "здравствуйте < мир" | textutil -convert txt -format html -stdin -stdout -inputencoding 4` – FluffulousChimp Jun 03 '22 at 02:52

unagi · Answer 8 · 2019-12-06T02:21:30.763

To support the unescaping of all HTML entities only with sed substitutions would require too long a list of commands to be practical, because every Unicode code point has at least two corresponding HTML entities.

But it can be done using only sed, grep, the Bourne shell and basic UNIX utilities (the GNU coreutils or equivalent):

#!/bin/sh

htmlEscDec2Hex() {
    file=$1
    [ ! -r "$file" ] && file=$(mktemp) && cat >"$file"

    printf -- \
        "$(sed 's/\\/\\\\/g;s/%/%%/g;s/&#[0-9]\{1,10\};/\&#x%x;/g' "$file")\n" \
        $(grep -o '&#[0-9]\{1,10\};' "$file" | tr -d '&#;')

    [ x"$1" != x"$file" ] && rm -f -- "$file"
}

htmlHexUnescape() {
    printf -- "$(
        sed 's/\\/\\\\/g;s/%/%%/g
            ;s/&#x\([0-9a-fA-F]\{1,8\}\);/\&#x0000000\1;/g
            ;s/&#x0*\([0-9a-fA-F]\{4\}\);/\\u\1/g
            ;s/&#x0*\([0-9a-fA-F]\{8\}\);/\\U\1/g' )\n"
}

htmlEscDec2Hex "$1" | htmlHexUnescape \
    | sed -f named_entities.sed

Note, however, that a printf implementation supporting \uHHHH and \UHHHHHHHH sequences is required, such as the GNU utility’s. To test, check for example that printf "\u00A7\n" prints §. To call the utility instead of the shell built-in, replace the occurrences of printf with env printf.

This script uses an additional file, named_entities.sed, in order to support the named entities. It can be generated from the specification using the following HTML page:

<!DOCTYPE html>
<head><meta charset="utf-8" /></head>
<body>
<p id="sed-script"></p>
<script type="text/javascript">
  const referenceURL = 'https://html.spec.whatwg.org/entities.json';

  function writeln(element, text) {
    element.appendChild( document.createTextNode(text) );
    element.appendChild( document.createElement("br") );
  }

  (async function(container) {
    const json = await (await fetch(referenceURL)).json();
    container.innerHTML = "";
    writeln(container, "#!/usr/bin/sed -f");
    const addLast = [];
    for (const name in json) {
      const characters = json[name].characters
        .replace("\\", "\\\\")
        .replace("/", "\\/");
      const command = "s/" + name + "/" + characters + "/g";
      if ( name.endsWith(";") ) {
        writeln(container, command);
      } else {
        addLast.push(command);
      }
    }
    for (const command of addLast) { writeln(container, command); }
  })( document.getElementById("sed-script") );
</script>
</body></html>

Simply open it in a modern browser, and save the resulting page as text as named_entities.sed. This sed script can also be used alone if only named entities are required; in this case it is convenient to give it executable permission so that it can be called directly.

Now the above shell script can be used as ./html_unescape.sh foo.html, or inside a pipeline reading from standard input.

For example, if for some reason it is needed to process the data by chunks (it might be the case if printf is not a shell built-in and the data to process is large), one could use it as:

nLines=20
seq 1 $nLines $(grep -c $ "$inputFile") | while read n
    do sed -n "$n,$((n+nLines-1))p" "$inputFile" | ./html_unescape.sh
done

Explanation of the script follows.

There are three types of escape sequences that need to be supported:

&#D; where D is the decimal value of the escaped character’s Unicode code point;
&#xH; where H is the hexadecimal value of the escaped character’s Unicode code point;
&N; where N is the name of one of the named entities for the escaped character.

The &N; escapes are supported by the generated named_entities.sed script which simply performs the list of substitutions.

The central piece of this method for supporting the code point escapes is the printf utility, which is able to:

print numbers in hexadecimal format, and
print characters from their code point’s hexadecimal value (using the escapes \uHHHH or \UHHHHHHHH).

The first feature, with some help from sed and grep, is used to reduce the &#D; escapes into &#xH; escapes. The shell function htmlEscDec2Hex does that.

The function htmlHexUnescape uses sed to transform the &#xH; escapes into printf’s \u/\U escapes, then uses the second feature to print the unescaped characters.

score 1 · Answer 9 · edited Jul 21 '23 at 09:31

I like the Perl answer given in https://stackoverflow.com/a/13161719/1506477.

cat foo.html | perl -MHTML::Entities -pe 'decode_entities($_);'

But, it produced an unequal number of lines on plain text files. (and I dont know perl enough to debug it.)

I like the python answer given in https://stackoverflow.com/a/42672936/1506477 --

python3 -c 'import html, sys; [print(html.unescape(l), end="") for l in sys.stdin]'

but it creates a list [ ... for l in sys.stdin] in memory, that is forbidden for large files.

Here is another easy pythonic way without buffering in memory: using awkg.

$ echo 'hello &lt; &#x3a; &quot; world' | \
   awkg -b 'from html import unescape' 'print(unescape(R0))'
hello < : " world

awkg is a python based awk-like line processor. You may install it using pip https://pypi.org/project/awkg/:

pip install awkg

-b is awk's BEGIN{} block that runs once in the beginning.
Here we just did from html import unescape.

Each line record is in R0 variable, for which we did print(unescape(R0))

Disclaimer:
I am the maintainer of awkg

score 0 · Answer 10 · answered Aug 25 '20 at 20:01

0

I have created a sed script based on the list of entities so it must handle most of the entities.

sed -f htmlentities.sed < file.html

answered Aug 25 '20 at 20:01

Ajnasz

115
4

2

The script does not handle the hexadecimal entities, and at least 2000 named entities. Of course it does not handle all the decimal entities either (such as の). See my answer above for a complete solution. Lastly, the input redirection < is not needed when invoking sed. – unagi Aug 29 '20 at 12:13
@unagi Oh god, why didn't you just put it somewhere as a downloadable script with short explanation? :) It would save me some minutes of work.I didn't even read through the answer as there was two different thing, where the first I guess doesn't handle the named entities and the second was an html/javascript code what I didn't want to involve as I just wanted a simple sed script. :) – Ajnasz Aug 29 '20 at 16:18

score 0 · Answer 11 · answered Jan 06 '22 at 18:29

My original answer got some comments, that recode does not work for UTF-8 encoded HTML files. This is correct. recode supports only HTML 4. The encoding HTML is an alias for HTML_4.0:

$ recode -l | grep -iw html
HTML-i18n 2070 RFC2070
HTML_4.0 h h4 HTML

The default encoding for HTML 4 is Latin-1. This has changed in HTML 5. The default encoding for HTML 5 is UTF-8. This is the reason, why recode does not work for HTML 5 files.

HTML 5 defines the list of entities here:

https://html.spec.whatwg.org/multipage/named-characters.html

The definition includes a machine readable specification in JSON format:

https://html.spec.whatwg.org/entities.json

The JSON file can be used to perform a simple text replacement. The following example is a self modifying Perl script, which caches the JSON specification in its DATA chunk.

Note: For some obscure compatibility reasons, the specification allows entities without a terminating semicolon. Because of that the entities are sorted by length in reverse order to make sure, that the correct entities are replaced first so that they do not get destroyed by entities without the ending semicolon.

#! /usr/bin/perl
use utf8;
use strict;
use warnings;
use open qw(:std :utf8);
use LWP::Simple;
use JSON::Parse qw(parse_json);

my $entities;

INIT {
  if (eof DATA) {
    my $data = tell DATA;
    open DATA, '+<', $0;
    seek DATA, $data, 0;
    my $entities_json = get 'https://html.spec.whatwg.org/entities.json';
    print DATA $entities_json;
    truncate DATA, tell DATA;
    close DATA;
    $entities = parse_json ($entities_json);
  } else {
    local $/ = undef;
    $entities = parse_json (<DATA>);
  }
}

local $/ = undef;
my $html = <>;

for my $entity (sort { length $b <=> length $a } keys %$entities) {
  my $characters = $entities->{$entity}->{characters};
  $html =~ s/$entity/$characters/g;
}

print $html;

__DATA__

Example usage:

$ echo '&nbsp;&amp;&nbsp;ٱلْعَرَبِيَّة' | ./html5-to-utf8.pl
 & ٱلْعَرَبِيَّة

score -1 · Answer 12 · answered Jan 12 '20 at 15:26

-1

With Xidel:

echo 'hello &lt; &#x3a; &quot; world' | xidel -s - -e 'parse-html($raw)'
hello < : " world

answered Jan 12 '20 at 15:26

Reino

3,203
1
13
21

Bash script to convert from HTML entities to characters

12 Answers12

Edit June 26, 2017

Linked

Related