36

I can't believe this question hasn't been asked before. I have a string that needs to be inserted into an HTML file but it may contain special HTML characters. I want to replace these with the appropriate HTML representation.

The code below works but is pretty verbose and ugly. Performance is not critical for my application but I guess there are scalability problems here also. How can I improve this? I guess this is a job for STL algorithms or some esoteric Boost function, but the code below is the best I can come up with myself.

void escape(std::string *data)
{
    std::string::size_type pos = 0;
    for (;;)
    {
        pos = data->find_first_of("\"&<>", pos);
        if (pos == std::string::npos) break;
        std::string replacement;
        switch ((*data)[pos])
        {
        case '\"': replacement = "&quot;"; break;   
        case '&':  replacement = "&amp;";  break;   
        case '<':  replacement = "&lt;";   break;   
        case '>':  replacement = "&gt;";   break;   
        default: ;
        }
        data->replace(pos, 1, replacement);
        pos += replacement.size();
    };
}
paperjam
  • 8,321
  • 12
  • 53
  • 79
  • 1
    Do you really need to replace quotes? I though they were valid XML (I'd replace \n and \r too). – Giovanni Funchal Apr 14 '11 at 15:13
  • Yes, that's a different question and a good one. Which characters need replacing? – paperjam Apr 14 '11 at 15:19
  • Indeed :-) check this http://stackoverflow.com/questions/1091945/where-can-i-get-a-list-of-the-xml-document-escape-characters – Giovanni Funchal Apr 14 '11 at 15:23
  • @Gionvanni: depends on context. If the string is being pasted into the middle of an attribute value, like `tag = " – Steve Jessop Apr 14 '11 at 16:25

9 Answers9

43

Instead of just replacing in the original string, you can do copying with on-the-fly replacement which avoids having to move characters in the string. This will have much better complexity and cache behavior, so I'd expect a huge improvement. Or you can use boost::spirit::xml encode or http://code.google.com/p/pugixml/.

void encode(std::string& data) {
    std::string buffer;
    buffer.reserve(data.size());
    for(size_t pos = 0; pos != data.size(); ++pos) {
        switch(data[pos]) {
            case '&':  buffer.append("&amp;");       break;
            case '\"': buffer.append("&quot;");      break;
            case '\'': buffer.append("&apos;");      break;
            case '<':  buffer.append("&lt;");        break;
            case '>':  buffer.append("&gt;");        break;
            default:   buffer.append(&data[pos], 1); break;
        }
    }
    data.swap(buffer);
}

EDIT: A small improvement can be achieved by using an heuristic to determine the size of the buffer. Replace the buffer.reserve line with data.size()*1.1 (10%) or something similar depending of how much replacements are expected.

Omer Mor
  • 5,216
  • 2
  • 34
  • 39
Giovanni Funchal
  • 8,934
  • 13
  • 61
  • 110
  • True, unless replacements are rare. Thanks for the links too but I don't think either is practical to use. – paperjam Apr 14 '11 at 15:16
  • 1
    I didn't know about CDATA but it looks like web browsers don't respect it in HTML. – paperjam Apr 14 '11 at 15:30
  • If it is HTML that you want to escape, it might be much harder than simple XML. You might need a heavyweight library if you want the encoded string to be protected against weird characters. Check this: http://site.icu-project.org/ – Giovanni Funchal Apr 14 '11 at 15:33
  • I'm using ASCII so surely there are only around 100 characters to consider and only a few of these might cause problems? – paperjam Apr 14 '11 at 15:37
  • If the input is 7 bit ascii, my function is pretty safe I think. Accents may be problematic. See this: http://www.w3.org/TR/html4/charset.html – Giovanni Funchal Apr 14 '11 at 15:44
  • Word of caution: the `boost::spirit::xml encode` implementation is a disaster; not only is it potentially O(N^2) from a performance standpoint, but it also fais to escape quotes (of any kind), and escapes newline/CR as `\n`/`\r` instead of ` `/` `. See https://github.com/boostorg/spirit/blob/master/include/boost/spirit/home/classic/tree/impl/tree_to_xml.ipp – vladr Mar 18 '14 at 19:22
  • @GiovanniFunchal -- I realize this is an old answered question, but what about character entities like hhhh (hexidecimal XML character entity) and dddd (decimal XML character entity)? Should they be ignored and passed through untouched into the XML? – SMGreenfield Apr 01 '18 at 20:01
  • @SMGreenfield if your objective is escaping (as per the original question), then I believe this related question answers it https://stackoverflow.com/questions/1091945/what-characters-do-i-need-to-escape-in-xml-documents – Giovanni Funchal Apr 03 '18 at 08:41
6
void escape(std::string *data)
{
    using boost::algorithm::replace_all;
    replace_all(*data, "&",  "&amp;");
    replace_all(*data, "\"", "&quot;");
    replace_all(*data, "\'", "&apos;");
    replace_all(*data, "<",  "&lt;");
    replace_all(*data, ">",  "&gt;");
}

Could win the prize for least verbose?

paperjam
  • 8,321
  • 12
  • 53
  • 79
  • 4
    Care for the order, you should start with "&" :-) – Giovanni Funchal Apr 14 '11 at 15:14
  • 1
    If you just want to get the job done, definitely the most robust way to go. Encoding ASCII text HTML, though, it should be enough to quote &, < and >. You don't need to quote quotation marks if the text is not going into a node attribute. – Antti Huima Apr 14 '11 at 15:49
  • 6
    Your implementation should also win the prize for worst performing, as it will encode an N-long string of ampersands/quotes/etc. in O(N^2) complexity. – vladr Mar 18 '14 at 19:25
3

Here is a simple ~30 line C program that does the trick in a rather good manner. Here I am assuming that the temp_str will have allocated memory enough to have the additional escaped characters.

void toExpatEscape(char *temp_str)
{
    const char cEscapeChars[6]={'&','\'','\"','>','<','\0'};
    const char * const pEscapedSeqTable[] =
    {
        "&amp;",
        "&apos;",
        "&quot;",
        "&gt;",
        "&lt;",
    };
    unsigned int i, j, k, nRef = 0, nEscapeCharsLen = strlen(cEscapeChars), str_len = strlen(temp_str);
    int nShifts = 0; 

    for (i=0; i<str_len; i++)
    {
        for(nRef=0; nRef<nEscapeCharsLen; nRef++)
        {
            if(temp_str[i] == cEscapeChars[nRef])
            {
                if((nShifts = strlen(pEscapedSeqTable[nRef]) - 1) > 0)
                {
                    memmove(temp_str+i+nShifts, temp_str+i, str_len-i+nShifts); 
                    for(j=i,k=0; j<=i+nShifts,k<=nShifts; j++,k++)
                        temp_str[j] = pEscapedSeqTable[nRef][k];
                    str_len += nShifts;
                }
            }
        }  
    }
    temp_str[str_len] = '\0';
}
Aswath
  • 61
  • 2
3

My tests showed this answer gave the best performance from offered (not surprising it has the most rate).
I've implemented same algorithm for my project (I really want good performance & memory usage) - my tests showed my implementation has ~2.6-3.25 better speed performace. Also I don't like previous best offered algorithm bcs of bad memory usage - you will have extra memory usage as when apply 1.1 multiplier 'heuristic', as when .append() lead to resize.
So, leave my code here - maybe somebody find it useful.

HtmlPreprocess.h:

#ifndef _HTML_PREPROCESS_H_
#define _HTML_PREPROCESS_H_

#include <string>

class HtmlPreprocess
{
public:
    HtmlPreprocess();
    ~HtmlPreprocess();

    static void htmlspecialchars(
        const std::string & in,
        std::string & out
        );
};

#endif // _HTML_PREPROCESS_H_

HtmlPreprocess.cpp:

#include "HtmlPreprocess.h"


HtmlPreprocess::HtmlPreprocess()
{
}


HtmlPreprocess::~HtmlPreprocess()
{
}


const unsigned char map_char_to_final_size[] = 
{
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   6,   1,   1,   1,   5,   6,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   4,   1,   4,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1
};


const unsigned char map_char_to_index[] = 
{
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   2,      0xFF,   0xFF,   0xFF,   0,      1,      0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   4,      0xFF,   3,      0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF
};


void HtmlPreprocess::htmlspecialchars(
    const std::string & in,
    std::string & out
    )
{
    const char * lp_in_stored = &in[0];
    size_t in_size = in.size();

    const char * lp_in = lp_in_stored;
    size_t final_size = 0;
    for (size_t i = 0; i < in_size; i++)
        final_size += map_char_to_final_size[*lp_in++];

    out.resize(final_size);

    lp_in = lp_in_stored;
    char * lp_out = &out[0];

    for (size_t i = 0; i < in_size; i++)
    {
        char current_char = *lp_in++;
        unsigned char next_action = map_char_to_index[current_char];

        switch (next_action){
        case 0:
            *lp_out++ = '&';
            *lp_out++ = 'a';
            *lp_out++ = 'm';
            *lp_out++ = 'p';
            *lp_out++ = ';';
            break;
        case 1:
            *lp_out++ = '&';
            *lp_out++ = 'a';
            *lp_out++ = 'p';
            *lp_out++ = 'o';
            *lp_out++ = 's';
            *lp_out++ = ';';
            break;
        case 2:
            *lp_out++ = '&';
            *lp_out++ = 'q';
            *lp_out++ = 'u';
            *lp_out++ = 'o';
            *lp_out++ = 't';
            *lp_out++ = ';';
            break;
        case 3:
            *lp_out++ = '&';
            *lp_out++ = 'g';
            *lp_out++ = 't';
            *lp_out++ = ';';
            break;
        case 4:
            *lp_out++ = '&';
            *lp_out++ = 'l';
            *lp_out++ = 't';
            *lp_out++ = ';';
            break;
        default:
            *lp_out++ = current_char;
        }
    }
}
HostageBrain
  • 224
  • 2
  • 7
  • When you use a lookup table you should use an unsigned parameter, e.g. map_char_to_index[static_cast(current_char)] and not map_char_to_index[current_char]. – PatrickF Aug 08 '17 at 10:44
  • Overweight design and syntax. Don't go this way, it's too 90's C-stylish. – Sandburg Oct 14 '21 at 12:31
2

You can use the boost::property_tree::xml_parser::encode_char_entities if you don't want to write it yourself.

For reference, here's the code in boost 1.64.0:

```

template<class Str>
Str encode_char_entities(const Str &s)
{
    // Don't do anything for empty strings.
    if(s.empty()) return s;

    typedef typename Str::value_type Ch;

    Str r;
    // To properly round-trip spaces and not uglify the XML beyond
    // recognition, we have to encode them IF the text contains only spaces.
    Str sp(1, Ch(' '));
    if(s.find_first_not_of(sp) == Str::npos) {
        // The first will suffice.
        r = detail::widen<Str>("&#32;");
        r += Str(s.size() - 1, Ch(' '));
    } else {
        typename Str::const_iterator end = s.end();
        for (typename Str::const_iterator it = s.begin(); it != end; ++it)
        {
            switch (*it)
            {
                case Ch('<'): r += detail::widen<Str>("&lt;"); break;
                case Ch('>'): r += detail::widen<Str>("&gt;"); break;
                case Ch('&'): r += detail::widen<Str>("&amp;"); break;
                case Ch('"'): r += detail::widen<Str>("&quot;"); break;
                case Ch('\''): r += detail::widen<Str>("&apos;"); break;
                default: r += *it; break;
            }
        }
    }
    return r;
}

```

topkara
  • 886
  • 9
  • 15
2

I profiled 3 solutions with Visual Studio 2017. Input were 10 000 000 strings of size 5-20 with a probability of 9,4% that a char needs to be escaped.

  1. Solution from Giovanni Funchal
  2. Solution from HostageBrain
  3. Solution is mine

The result:

  1. needs 1.675 seconds
  2. needs 0.769 seconds
  3. needs 0.368 seconds

In mine Solution, the final size is precalculated and a copy of string data is done, only when needed. So the heap memory allocations should be minimal.

const unsigned char calcFinalSize[] =
{
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   6,   1,   1,   1,   5,   6,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   4,   1,   4,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1
};

void escapeXml(std::string & in)
{
    const char* dataIn = in.data();
    size_t sizeIn = in.size();

    const char* dataInCurrent = dataIn;
    const char* dataInEnd = dataIn + sizeIn;
    size_t outSize = 0;
    while (dataInCurrent < dataInEnd)
    {
        outSize += calcFinalSize[static_cast<uint8_t>(*dataInCurrent)];
        dataInCurrent++;
    }


    if (outSize == sizeIn)
    {
        return;
    }
    std::string out;
    out.resize(outSize);

    dataInCurrent = dataIn;
    char* dataOut = &out[0];
    while (dataInCurrent < dataInEnd)
    {
        switch (*dataInCurrent) {
        case '&':
            memcpy(dataOut, "&amp;", sizeof("&amp;") - 1);
            dataOut += sizeof("&amp;") - 1;
            break;
        case '\'':
            memcpy(dataOut, "&apos;", sizeof("&apos;") - 1);
            dataOut += sizeof("&apos;") - 1;
            break;
        case '\"':
            memcpy(dataOut, "&quot;", sizeof("&quot;") - 1);
            dataOut += sizeof("&quot;") - 1;
            break;
        case '>':
            memcpy(dataOut, "&gt;", sizeof("&gt;") - 1);
            dataOut += sizeof("&gt;") - 1;
            break;
        case '<':
            memcpy(dataOut, "&lt;", sizeof("&lt;") - 1);
            dataOut += sizeof("&lt;") - 1;
            break;
        default:
            *dataOut++ = *dataInCurrent;
        }
        dataInCurrent++;
    }
    in.swap(out);
}

Edit: Replaced "&quote;" with "&quot;". Old solution was overwriting memory, because the look-up table contained a length of 6 for "&quote;".

Community
  • 1
  • 1
PatrickF
  • 594
  • 2
  • 11
2

If you're going for processing speed, then it seems to me that the best would be to have a second string that you build as you go, copying from the first string to the second string, and then appending the html escapes as you encounter them. Since I assume that the replace method involves first a memory move, followed by a copy into the replaced position, it's going to be very slow for large strings. If you have a second string to build using .append(), it will avoid the memory move.

As far was code "cleanness", I think that's about as pretty as you're going to get. You could create an array of characters and their replacements, and then search the array, but that would probably be slower and not much cleaner anyway.

Anthony Johnson
  • 636
  • 8
  • 14
2

I'd honestly go with a more generic version using iterators, such that you can "stream" the encoding. Consider the following implementation:

#include <algorithm>

namespace xml {

    // Helper for null-terminated ASCII strings (no end of string iterator).
    template<typename InIter, typename OutIter>
    OutIter copy_asciiz ( InIter begin, OutIter out )
    {
        while ( *begin != '\0' ) {
            *out++ = *begin++;
        }
        return (out);
    }

    // XML escaping in it's general form.  Note that 'out' is expected
    // to an "infinite" sequence.
    template<typename InIter, typename OutIter>
    OutIter escape ( InIter begin, InIter end, OutIter out )
    {
        static const char bad[] = "&<>";
        static const char* rep[] = {"&amp;", "&lt;", "&gt;"};
        static const std::size_t n = sizeof(bad)/sizeof(bad[0]);

        for ( ; (begin != end); ++begin )
        {
            // Find which replacement to use.
            const std::size_t i =
                std::distance(bad, std::find(bad, bad+n, *begin));

            // No need for escaping.
            if ( i == n ) {
                *out++ = *begin;
            }
            // Escape the character.
            else {
                out = copy_asciiz(rep[i], out);
            }
        }
        return (out);
    }

}

Then, you can simplify the average case using a few overloads:

#include <iterator>
#include <string>

namespace xml {

    // Get escaped version of "content".
    std::string escape ( const std::string& content )
    {
        std::string result;
        result.reserve(content.size());
        escape(content.begin(), content.end(), std::back_inserter(result));
        return (result);
    }

    // Escape data on the fly, using "constant" memory.
    void escape ( std::istream& in, std::ostream& out )
    {
        escape(std::istreambuf_iterator<char>(in),
            std::istreambuf_iterator<char>(),
            std::ostreambuf_iterator<char>(out));
    }

}

Finally, test the whole lot:

#include <iostream>

int main ( int, char ** )
{
    std::cout << xml::escape("<foo>bar & qux</foo>") << std::endl;
}
André Caron
  • 44,541
  • 12
  • 67
  • 125
  • This seems slick, and like it should work. But I get a compiler error trying to use it: std::cout << xml2::escape("bar & qux") << std::endl; error C2780: 'OutIter xml2::escape(InIter,InIter,OutIter)' : expects 3 arguments - 1 provided (Using Visual Studio 2012 Pro) – Darrin Aug 13 '15 at 05:23
-2

Or with just stl :

 std::string& rep(std::string &s, std::string from, std::string to)
    {
      int pos = -1;
      while ( (pos = s.find(from, pos+1) ) != string::npos)
        s.erase(pos, from.length()).insert(pos, to);

      return s;
    }

Usage:

rep(s, "&", "&quot;");
rep(s, "\"", "&quot;");

or:

rep(s, "HTML","xxxx");
etipici
  • 415
  • 3
  • 6
  • The `from` and `to` strings are being passed by copy. Use a `const&` instead. Also, chaining `erase` and `insert` in a loop that executes `find` is very bad for performance because `strings` are contiguous arrays. You are potentially at O(n^3). – Giovanni Funchal Apr 19 '11 at 07:02