How do I unescape a html attribute value in Prolog?

Question

I find a predicate xml_quote_attribute/2 in a library(sgml) of SWI-Prolog. This predicate works with the first argument as input and the second argument as output:

?- xml_quote_attribute('<abc>', X).
X = '&lt;abc&gt;'.

But I couldn't figure out how I can do the reverse conversion. For example the following query doesn't work:

?- xml_quote_attribute(X, '&lt;abc&gt;').
ERROR: Arguments are not sufficiently instantiated

Is there another predicate that does the job?

Bye

When I needed this functionality, I wrote it myself. But that was almost 5 years ago... maybe something has been added in meanwhile. — CapelliC, Aug 29 '14 at 13:02

Wouter Beek · Answer 1 · 2014-09-14T06:42:25.723

3

This is how Ruud's solution looks like with DCG notation + pushback lists / semicontext notation.

:- use_module(library(dcg/basics)).

html_unescape --> sgml_entity, !, html_unescape.
html_unescape, [C] --> [C], !, html_unescape.
html_unescape --> [].

sgml_entity, [C] --> "&#", integer(C), ";".
sgml_entity, "<" --> "&lt;".
sgml_entity, ">" --> "&gt;".
sgml_entity, "&" --> "&amp;".

Using DCGs makes the code a bit more readable. It also does away with some of the superfluous backtracking that Cookie Monster noted is the result of using append/3 for this.

edited Sep 14 '14 at 06:42

answered Sep 13 '14 at 10:04

Wouter Beek

3,307
16
29

Is there still a place mentioning "right-hand context"? This notion turned out to be quite misleading. Instead, use semicontext. – false Sep 13 '14 at 12:58
@CookieMonster: What you prefer is not steadfast. – false Sep 13 '14 at 18:26
1

@false I've changed the term "right-hand context" to "semicontext". It's indeed less confusing. – Wouter Beek Sep 14 '14 at 06:46
@CookieMonster: "Best would be if both holds," that is next-to-impossible: Adding a single extra non-terminal `epsilon --> [].` into the rule-body already destroys any such properties. Only steadfastness remains. You might (intenrally and transparently) check if the 2nd list argument is of a certain form, but this has to be a conservative check. – false Sep 14 '14 at 13:44
A kind bidirectionallity is possible in what I posted on G+. Making it steadfast again, shouldn't be a problem. The only problem is that it is hand coded in Prolog, although using var/1, but more in a CLP fashion, and I didn't yet find a DCG compilation technique which can do the same. – Sep 15 '14 at 10:21

Ruud Helderman · Answer 2 · 2014-09-06T19:16:15.513

1

Here's the naive solution, using lists of character codes. Most likely it will not give you the best performance possible, but for strings that are not extremely long, it might just be alright.

html_unescape("", "") :- !.

html_unescape(Escaped, Unescaped) :-
    append("&", _, Escaped),
    !,
    append(E1, E2, Escaped),
    sgml_entity(E1, U1),
    !,
    html_unescape(E2, U2),
    append(U1, U2, Unescaped).

html_unescape(Escaped, Unescaped) :-
    append([C], E2, Escaped),
    html_unescape(E2, U2),
    append([C], U2, Unescaped).

sgml_entity(Escaped, [C]) :-
    append(["&#", L, ";"], Escaped),
    catch(number_codes(C, L), error(syntax_error(_), _), fail),
    !.

sgml_entity("&lt;", "<").
sgml_entity("&gt;", ">").
sgml_entity("&amp;", "&").

You will have to complete the list of SGML entities yourself.

Sample output:

?- html_unescape("&lt;a&gt; &#26361;&#25805;", L), format('~s', [L]).
<a> 曹操
L = [60, 97, 62, 32, 26361, 25805].

edited Sep 06 '14 at 19:16

answered Aug 31 '14 at 21:32

Ruud Helderman

10,563
1
26
45

You mean: lists of (character) codes instead "of characters". E.g. `char_code(a, 0'a).` – false Sep 06 '14 at 18:57
You are right, sorry. I adjusted that text, but please feel free to make additional edits if necessary. – Ruud Helderman Sep 06 '14 at 19:18
There are some issues of steadfastness in `sgml_entity/2`. – false Sep 06 '14 at 19:23
I suppose I could make `sgml_entity/2` more efficient by closing each rule with a cut, but I guess you did not mean that. Even without cuts, `sgml_entity/2` seems deterministic. Could be my lack of experience with logic programming that I still fail to spot a lack of steadfastness here. Can you give a concrete example where the wrong output could force my predicates down the wrong path? – Ruud Helderman Sep 06 '14 at 20:27
It's not efficiency, but rather all these special cases. number_codes/2 accepts quite a complex syntax - [not necessarily the one you expect](http://www.complang.tuwien.ac.at/ulrich/iso-prolog/number_chars). – false Sep 06 '14 at 21:32
1

Hope you still improve your solution. I awarded the bounty anyway to avoid it being lost. – false Sep 06 '14 at 22:06
@CookieMonster: Sorry about that; my Prolog experience is dated, and apparently I overestimated the effect of 25 years of evolution on optimizing Prolog compilers. It puzzles me why Jan posted that solution on Google+ rather than here (either as an answer, or as part of the question if Jan is OP - to show others what he has tried already). But it's never too late for that. – Ruud Helderman Sep 07 '14 at 20:57
1

The problem is not compiler technology, its rather that the append/3 does generate a little bit too much backtracking. For example it will call the predicate sgml_entity/2 first with "", then with "&", then with "&l", then with "&lt" and then with "<". If you would for example use DCG, you could directly find the matching prefix by scanning the clauses of the predicate sgml_entity/2 only once. – Sep 07 '14 at 21:08

score 1 · Answer 3 · edited May 23 '17 at 12:03

If you don't mind linking a foreign module, then you can make a very efficient implementation in C.

html_unescape.pl:

:- module(html_unescape, [ html_unescape/2 ]).
:- use_foreign_library(foreign('./html_unescape.so')).

html_unescape.c:

#include <stdio.h>
#include <string.h>
#include <SWI-Prolog.h>

static int to_utf8(char **unesc, unsigned ccode)
{
    int ok = 1;
    if (ccode < 0x80)
    {
        *(*unesc)++ = ccode;
    }
    else if (ccode < 0x800)
    {
        *(*unesc)++ = 192 + ccode / 64;
        *(*unesc)++ = 128 + ccode % 64;
    }
    else if (ccode - 0xd800u < 0x800)
    {
        ok = 0;
    }
    else if (ccode < 0x10000)
    {
        *(*unesc)++ = 224 + ccode / 4096;
        *(*unesc)++ = 128 + ccode / 64 % 64;
        *(*unesc)++ = 128 + ccode % 64;
    }
    else if (ccode < 0x110000)
    {
        *(*unesc)++ = 240 + ccode / 262144;
        *(*unesc)++ = 128 + ccode / 4096 % 64;
        *(*unesc)++ = 128 + ccode / 64 % 64;
        *(*unesc)++ = 128 + ccode % 64;
    }
    else
    {
        ok = 0;
    }
    return ok;
}

static int numeric_entity(char **esc, char **unesc)
{
    int consumed;
    unsigned ccode;
    int ok = (sscanf(*esc, "&#%u;%n", &ccode, &consumed) > 0 ||
              sscanf(*esc, "&#x%x;%n", &ccode, &consumed) > 0) &&
             consumed > 0 &&
             to_utf8(unesc, ccode);
    if (ok)
    {
        *esc += consumed;
    }
    return ok;
}

static int symbolic_entity(char **esc, char **unesc, char *name, int ccode)
{
    int ok = strncmp(*esc, name, strlen(name)) == 0 &&
             to_utf8(unesc, ccode);
    if (ok)
    {
        *esc += strlen(name);
    }
    return ok;
}

static foreign_t pl_html_unescape(term_t escaped, term_t unescaped)
{
    char *esc;
    if (!PL_get_chars(escaped, &esc, CVT_ATOM | REP_UTF8))
    {
        PL_fail;
    }
    else if (strchr(esc, '&') == NULL)
    {
        return PL_unify(escaped, unescaped);
    }
    else
    {
        char buffer[strlen(esc) + 1];
        char *unesc = buffer;
        while (*esc != '\0')
        {
            if (*esc != '&' || !(numeric_entity(&esc, &unesc) ||
                                 symbolic_entity(&esc, &unesc, "&lt;", '<') ||
                                 symbolic_entity(&esc, &unesc, "&gt;", '>') ||
                                 symbolic_entity(&esc, &unesc, "&amp;", '&')))
                                    // TODO: more entities...
            {
                *unesc++ = *esc++;
            }
        }
        return PL_unify_chars(unescaped, PL_ATOM | REP_UTF8, unesc - buffer, buffer);
    }
}

install_t install_html_unescape()
{
    PL_register_foreign("html_unescape", 2, pl_html_unescape, 0);
}

The following statement will build a shared library html_unescape.so from html_unescape.c. Tested on Ubuntu 14.04; may be different on Windows.

swipl-ld -shared -o html_unescape html_unescape.c

Start up SWI-Prolog:

swipl html_unescape.pl

Sample output:

?- html_unescape('&lt;a&gt; &#26361;&#25805;', S).
S = '<a> 曹操'.

With special thanks to the SWI-Prolog documentation and source code, and to C library to convert unicode code points to UTF8?

So html_unescape/2 does always a full string copy, if for example 'abc' is supplied to the predicate? i.e. if no entities occur? — , Sep 07 '14 at 14:59
@CookieMonster: For readability, I deliberately kept the C implementation as simple as possible. There is plenty of room for optimization, but the question is whether you _should_. See Ward's Wiki on [premature optimization](http://c2.com/cgi/wiki?PrematureOptimization), [optimize later](http://c2.com/cgi/wiki?OptimizeLater) and [rules of optimization](http://c2.com/cgi/wiki?RulesOfOptimization). — Ruud Helderman, Sep 07 '14 at 20:07
I guess a solution should. Since Prolog systems don't like too many atoms. For example SWI-Prolog will compute a murmur hash for the atom and look it up in a dictionary. Judging from the source, I am not an expert on SWI-Prolog. But of course this optimization is only optional, just wondering whether I overlooked something in your source. Was also voting up the answer. — , Sep 07 '14 at 21:38
@CookieMonster: You may well have a point; I just checked [the implementation of xml_quote_attribute](https://github.com/flavioc/yap/blob/59c4b452d6ba0564f4dae9ce3f21fb7bd7bace9b/packages/sgml/quote.c) and noticed [PL_unify](http://www.swi-prolog.org/pldoc/man?CAPI=PL_unify) being called specifically when no replacements are made. Jan Wielemaker probably didn't do that just for fun. I adjusted my code accordingly. Thanks for the suggestion and the upvote! — Ruud Helderman, Sep 07 '14 at 22:15

score 0 · Answer 4 · 2018-08-25T14:49:49.413

Not aspiring as being the ultimate answer, since it doesn't give a solution for SWI-Prolog. For a Java based interpreter the problem is that XML escaping is not part of J2SE, at least not in a simple form (didn't figure out how to use Xerxes or the like).

A possible route would be to interface to StringEscapeUtils ( * ) from Apache Commons. But then again this would not be necessary on Android since there is a class TextUtil. So we rolled our own ( * * ) little conversion. It works as follows:

?- text_escape('<abc>', X).
X = '&lt;abc&gt;'
?- text_escape(X, '&lt;abc&gt;').
X = '<abc>'

Note the use of the Java methods codePointAt() and charCount() respectively appendCodePoint() in the Java source code. So it could also escape and unescape code points above the basic plane, i.e. in a range >0xFFFF (currently not implemented, left as an exercise).

On the other hand the Apache libraries, at least version 2.6, are NOT surrogate pair aware and will place two decimal entities per code point instead as one.

Bye

( * ) Java: Class StringEscapeUtils Source
http://grepcode.com/file/repo1.maven.org/maven2/commons-lang/commons-lang/2.6/org/apache/commons/lang/Entities.java#Entities.escape%28java.io.Writer,java.lang.String%29

( * * ) Jekejeke Prolog: Module xml
http://www.jekejeke.ch/idatab/doclet/prod/en/docs/05_run/10_docu/05_frequent/07_theories/20_system/03_xml.html

How do I unescape a html attribute value in Prolog?

4 Answers4