-2

Here I've found a grate way to HTML encode/escape special chars. Now I wonder how to unescape HTML encoded text in C++?

So codebase is:

#include <algorithm>

namespace xml {

    // Helper for null-terminated ASCII strings (no end of string iterator).
    template<typename InIter, typename OutIter>
    OutIter copy_asciiz ( InIter begin, OutIter out )
    {
        while ( *begin != '\0' ) {
            *out++ = *begin++;
        }
        return (out);
    }

    // XML escaping in it's general form.  Note that 'out' is expected
    // to an "infinite" sequence.
    template<typename InIter, typename OutIter>
    OutIter escape ( InIter begin, InIter end, OutIter out )
    {
        static const char bad[] = "&<>";
        static const char* rep[] = {"&amp;", "&lt;", "&gt;"};
        static const std::size_t n = sizeof(bad)/sizeof(bad[0]);

        for ( ; (begin != end); ++begin )
        {
            // Find which replacement to use.
            const std::size_t i =
                std::distance(bad, std::find(bad, bad+n, *begin));

            // No need for escaping.
            if ( i == n ) {
                *out++ = *begin;
            }
            // Escape the character.
            else {
                out = copy_asciiz(rep[i], out);
            }
        }
        return (out);
    }

}

and

#include <iterator>
#include <string>

namespace xml {

    // Get escaped version of "content".
    std::string escape ( const std::string& content )
    {
        std::string result;
        result.reserve(content.size());
        escape(content.begin(), content.end(), std::back_inserter(result));
        return (result);
    }

    // Escape data on the fly, using "constant" memory.
    void escape ( std::istream& in, std::ostream& out )
    {
        escape(std::istreambuf_iterator<char>(in),
            std::istreambuf_iterator<char>(),
            std::ostreambuf_iterator<char>(out));
    }

}

Its - grate peace of code - it works for:

#include <iostream>

int main ( int, char ** )
{
    std::cout << xml::escape("<foo>bar & qux</foo>") << std::endl;
}

So I wonder - how to make HTML unescape in such manner?

Community
  • 1
  • 1
Rella
  • 65,003
  • 109
  • 363
  • 636
  • 1
    Uhm, not to be demeaning or anything, but why don't you just reverse what you did with the `escape()` method? – foxy Nov 02 '11 at 06:18
  • possible duplicate of [How can I decode HTML entities in C++?](http://stackoverflow.com/questions/2078520/how-can-i-decode-html-entities-in-c) – Rob Kennedy Nov 02 '11 at 06:19
  • I wrote [`unescape_xml_entities(InputIt, InputIt, OutputIt)`](https://gist.github.com/bb3ba230bdf679324292) to find out how it would look like without regexs, parser generators,string replace(), etc. – jfs Nov 03 '11 at 06:30
  • @freedompeace: "just reverse" requires to keep track of consumed input (the arguments are `InputIterator` i.e., one-pass algorithm is required). See [the code I've linked above](https://gist.github.com/bb3ba230bdf679324292). – jfs Nov 03 '11 at 06:49

1 Answers1

2

Take a look at how I've solved a similar problem for '&#(\d+);' strings i.e., numeric character references (NCRs) using boost::spirit, boost::regex_token_iterator, Flex, Perl.

In your case the regex is &(amp|lt|gt); if you don't need to convert all html entities.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670