How do I find the string index of a tag (an element) without counting expanded entities?

Question

I've got a large piece of text which I want to be able to select, storing the selected part by its startindex and endindex. (For example, selecting or in word would give me startindex 1 and endindex 2.)

This all works properly, but I've got a problem with HTML entities such as & (the ampersand).

I've created a little case in which the issue consists. You can see in the fiddle below that the startIndex inflates if you select anything beyond the &, because it doesn't count the & as a single character, but rather 5 characters: &.

Is there a way to make it count properly special characters like the ampersand, without screwing up the index?

http://jsfiddle.net/Eqct4/

JavaScript

$(document).ready(function() {
    $('#textBlock').mouseup(function() {
        var selectionRange = window.getSelection();
        if (!selectionRange.isCollapsed) {
            selectedText = selectionRange.getRangeAt(0).toString();
        }

        document.getElementById('textBlock').setAttribute('contenteditable', true);
        document.execCommand('strikethrough', false);
        var startIndex = $('#textBlock').html().indexOf('<strike>');
         $('#startindex').html('the startindex is: ' + startIndex);
        done();
    });
});

function done() {
    document.getElementById('textBlock').setAttribute('contenteditable', false);
    document.getSelection().removeAllRanges();
    removeStrikeFromElement($('#textBlock'));
}

function removeStrikeFromElement (element) {
    element.find('strike').each(function() {
        jQuery(this).replaceWith(removeStrikeFromElement(jQuery(this)));
    });
    return element.html();
}

I think/know it has to do with the $('#textBlock').html() used to do the indexOf instead of text(). The best way to get a start and endindex was to <strike> through the selected text since the execCommand let's me do that and it's a HTML tag never used in the application.

Just a little advice, you're using the .Net of all JavaScript libraries. Get used to the language some. For example: the line `document.getElementById('textBlock').setAttribute('contenteditable', true);` can easily be shortened with `$("#textBlock").attr("contenteditable", true);` — SpYk3HH, May 03 '13 at 12:55
You can make a small function that adds `html()` to a div. The same function takes the `text()` from that div and returns it. edit: http://stackoverflow.com/questions/1147359/how-to-decode-html-entities-using-jquery — Tim Vermaelen, May 03 '13 at 13:02
Also, you might want to look at [Rangy](https://code.google.com/p/rangy/) — SpYk3HH, May 03 '13 at 13:04
@TimVermaelen thanks. I'd like to keep the HTML that is there though. I just don't want the count to be messed up by characters that are 1 character long in the TEXT and 5 long in the HTML behind it... — CaptainCarl, May 03 '13 at 13:07

score 3 · Answer 1 · answered May 03 '13 at 13:11

3

If you really want to use your code and just modifying it a little you could replace all special characters with the visible equivalent, while keeping the html tags... Change your declaration of startIndex to this:

var startIndex = $('#textBlock').html().replace(/&amp;/g, "&").replace(/&quot;/g, "\"").indexOf('<strike>');

you can append the replaces() functions with other special characters you want to count as normal characters not the HTML version of them. In my example i replaced the & and the " characters.

There are more optimalisations possible in your code this is a simple way to fix your problem.

Hope this helps a bit, see the forked fiddle here http://jsfiddle.net/vQNyv/

answered May 03 '13 at 13:11

retanik

174
12

That seems to do the trick yes. However; How can i treat true html as HTML in the count? If you look at my updated fiddle(http://jsfiddle.net/Eqct4/2/) there's a ¢ added. That is counted as 1 character without replacing. So a: Why does it only count & as & and not ¢ as ¢? and B: how can i make it so that specialchars are filtered(As your example) but true HTML entities are not? – CaptainCarl May 03 '13 at 13:51
@CaptainCarl - It doesn't count ¢ because ¢ has no meaning in HTML, while & _does_. That is, it's not trying to expand all expandable symbols—it's trying to expand the symbols that might be interpreted as something special in HTML. – Andrew Cheong May 03 '13 at 14:08
I see. But how can i make a one-way-deal out of this than? I want to use a startindex including all HTML tags or none at all. In this case it would be a little bit of both. And obviously that isn't good for a solid start- and endindex... – CaptainCarl May 03 '13 at 14:09
@CaptainCarl - I _think_ I know what you mean. But you still have to use DOM. It's the only way to distinguish text nodes and entity nodes (types 3, 4, and 5) from actual element nodes (type 1). See the edit at the bottom of my answer. Comment on my answer hereon, so we stop bothering poor \@retanik. – Andrew Cheong May 03 '13 at 14:23

Andrew Cheong · Answer 2 · 2013-05-03T21:15:49.720

The Problem

Using html() returns:

This is a cool test &amp; <strike>stuff like</strike> that

Using text(), however, would return:

This is a cool test & stuff like that

So, html() is necessary in order to see the string, <strike>, but then of course all special entities are escaped, which they should be. There are ways to hack around this problem, but imagine what would happen if, say, the text was describing HTML itself:

Use the <strike></strike> tags to strike out text.

In this case, you want the interpretation,

Use the &lt;strike&gt;&lt;/strike&gt; tag to strike out text.

That's why the only correct way to approach this would be to iterate through DOM nodes.

The jQuery/DOM Solution

Here's a jsFiddle of my solution, and here's the code:

jQuery.fn.indexOfTag = function (tag) {
    var nodes = this[0].childNodes;
    var chars = 0;
    for (var i = 0; nodes && i < nodes.length; i++) {
        var node = nodes[i];
        var type = node.nodeType;
        if (type == 3 || type == 4 || type == 5) {
            // alert('advancing ' + node.nodeValue.length + ' chars');
            chars += node.nodeValue.length;
        } else if (type == 1) {
            if (node.tagName == tag.toUpperCase()) {
                // alert('found <' + node.tagName + '> at ' + chars + ', returning');
                return chars;
            } else {
                // alert('found <' + node.tagName + '>, recursing');
                var subIndexOfTag = $(node).indexOfTag(tag);
                if (subIndexOfTag == -1) {
                    // alert('did not find <' + tag.toUpperCase() + '> in <' + node.tagName + '>');
                    chars += $(node).text().length;
                } else {
                    // alert('found <' + tag.toUpperCase() + '> in <' + node.tagName + '>');
                    chars += subIndexOfTag;
                    return chars;
                }
            }
        }
    }
    return -1;
}

Uncomment the alert()s to gain insight into what's going on. Here's a reference on the nodeTypes.

The jQuery/DOM Solution counting outerHTML

Based on your comments, I think you're saying you do want to count HTML tags (character-wise), but just not the HTML entities. Here's a new jsFiddle of the function itself, and here's a new jsFiddle of it applied to your problem.

Thanks. Works pretty neat aswell... I made a comment on the post above you which shows the issue of .text() vs .html() perhaps you've got some new insight in it... — CaptainCarl, May 03 '13 at 14:06
@CaptainCarl - I updated [the last jsFiddle](http://jsfiddle.net/acheong87/ft7Y5/); it should now do what I think you're trying to do. To get the ending index, simply add `$('#textBlock').find('strike').outerHTML.length` to the starting index. Happy coding! — Andrew Cheong, May 03 '13 at 21:10

How do I find the string index of a tag (an element) without counting expanded entities?

JavaScript

2 Answers2

The Problem

The jQuery/DOM Solution

The jQuery/DOM Solution counting outerHTML

Linked