How can I enable a word-breaking function by length without split inside html-encoded special chars

Question

I would like to implement a functionality that insert a word-breaking TAG if a word is too long to appear in a single line.

    protected string InstertWBRTags(string text, int interval)
{
    if (String.IsNullOrEmpty(text) || interval < 1 || text.Length < interval)
    {
        return text;
    }
    int pS = 0, pE = 0, tLength = text.Length;
    StringBuilder sb = new StringBuilder(tLength * 2);

    while (pS < tLength)
    {
        pE = pS + interval;
        if (pE > tLength)
            sb.Append(text.Substring(pS));
        else
        {
            sb.Append(text.Substring(pS, pE - pS));
            sb.Append("&#8203;");//<wbr> not supported by IE 8
        }
        pS = pE;
    }
    return sb.ToString();
}

The problem is: What can I do, if the text contains html-encoded special chars? What can I do to prevent insertion of a TAG inside a ß? What can I do to count the real string length (that appears in browser)? A string like ♡♥♡♥ contains only 2 chars (hearts) in browser but its length is 14.

score 1 · Accepted Answer · edited May 23 '17 at 12:03

1

One solution would be to decode the entities into the Unicode characters they represent and work with that. To do that use System.Net.WebUtility.HtmlDecode() if you're in .NET 4 or System.Web.HttpUtility.HtmlDecode() otherwise.

But be aware that not all Unicode character fit in one char.

edited May 23 '17 at 12:03

Community

1
1

answered Jul 21 '10 at 14:22

svick

236,525
50
385
514

The `HtmlEncode` and `HtmlDecode` methods aren't symmetrical; decoding will convert the entities into single characters, but encoding won't convert all of these characters back into entities. Also, if the source text contains characters such as `<` and entities such as `<`, then there's no way of distinguishing those after decoding. – Niels van der Rest Jul 21 '10 at 14:31
I meant that he shouldn't use `HtmlDecode` at all. But that would require the output to be Unicode. – svick Jul 21 '10 at 15:16

Damian Leszczyński - Vash · Answer 2 · 2010-07-21T14:19:16.580

You need to pass through whole text character by character, when you find a & than you examine what is next, if you reach a # it is quite sure that after this till a column will be a set of number (you can check it also). I such situation you move your iterator to the position of nearest semicolon and increment the counter.

In Java dialect

int count = 0;

        for(int i = 0; i < text.length(); i++) {

            if(text.charAt(i) == '&') {
                i  = text.indexOf(';', i) + 1; // what, from
            }

            count++;

        }

Very simplified version

How can I enable a word-breaking function by length without split inside html-encoded special chars

2 Answers2