How to find a substring only in the text portion of an HTML string, with Javascript?

Question

UPDATE: I am no longer specifically in need of the answer to this question - I was able to solve the (larger) problem I had in an entirely different way (see my comment). However, I'll check in occasionally, and if a viable answer arrives, I'll accept it. (It may take a week or three, though, as I'm only here sporadically.)

I have a string. It may or may not have HTML tags in it. So, it could be:

'This is my unspanned string'

or it could be:

'<span class="someclass">This is my spanned string</span>'

or:

'<span class="no-text"></span><span class="some-class"><span class="other-class">This is my spanned string</span></span>'

or:

'<span class="no-text"><span class="silly-example"></span></span><span class="some-class">This is my spanned string</span>'

I want to find the index of a substring, but only in the portion of the string that, if the string were turned into a DOM element, would be (a) TEXT node(s). In the example, only in the part of the string that has the plain text This is my string.

However, I need the location of the substring in the whole string, not only in the plain text portion.

So, if I'm searching for "span" in each of the strings above:

searching the first one will return 13 (0-based),
searching the second will skip the opening span tag in the string and return 35 for the string span in the word spanned
searching the third will skip the empty span tag and the openings of the two nested span tags, and return 91
searching the fourth will skip the nested span tags and the opening of the second span tag, and return 100

I don't want to remove any of the HTML tags, I just don't want them included in the search.

I'm aware that attempting to use regex is almost certainly a bad idea, probably even for simplistic strings as my code will be encountering, so please refrain from suggesting it.

I'm guessing I will need to use an HTML parser (something I've never done before). Is there one with which I can access the original parsed strings (or at least their lengths) for each node?

Might there be a simpler solution than that?

I did search around and wasn't been able to find anyone ask this particular question before, so if someone knows of something I missed, I apologize for faulty search skills.

This feels like you might be facing an [XY issue](http://www.perlmonks.org/index.pl?node_id=542341) — 1252748, Nov 03 '15 at 21:29
@thomas - Yes. There is indeed a larger issue, which can be boiled down to 'I'm working with a feature that was very poorly built and I don't have time to do the bottom-up rebuild it requires.' That said, I am looking at other ways of solving the larger problem I have, but being able to do what I asked may potentially be the most direct. — Wilson F, Nov 03 '15 at 21:39
Are you using node.js, or will the Javascript run inside an actual browser? — Mihai, Nov 03 '15 at 21:40
Thank you to everyone who attempted an answer. As @thomas surmised, it turned out that there was a totally different and unrelated (and sadly, non-obvious) solution to the larger problem I was trying to solve. I wasn't aware of said solution because of a deficiency in the documentation of our code. However, a colleague (who wasn't available when I originally asked my question) helpfully let me know. — Wilson F, Nov 03 '15 at 22:14
My question now is: even though I don't need the answer any more, should I leave the question here, if only as a warning to fellow travellers? — Wilson F, Nov 03 '15 at 22:14

score 0 · Answer 1 · answered Nov 03 '15 at 21:23

0

The search could loop through the string char by char. If inside a tag, skip the tag, search the string only outside tags and remember partial match in case the text is matched partially then interrupted with another tag, continue the search outside the tag.

answered Nov 03 '15 at 21:23

Martin Staufcik

8,295
4
44
63

That would be a parser, but the rules are a little more complex than that (e.g. it also has to ignore attributes and values within tags) and to correctly identify when it's in a tag. – RobG Nov 03 '15 at 22:38

Mihai · Answer 2 · 2015-11-03T22:50:33.837

You could use the browser's own HTML parser and XPath engine to search only inside the text nodes and do whatever processing you need.

Here's a partial solution:

var haystack = '  <span class="no-text"></span><span class="some-class"><span class="other-class">This is my spanned string</span></span>';
var needle = 'span';

var elt = document.createElement('elt');
elt.innerHTML = haystack;

var iter = document.evaluate('.//text()[contains(., "' + needle + '")]', elt).iterateNext();

if (iter) {
    var position = iter.textContent.indexOf(needle);
    var range = document.createRange();
    range.setStart(iter, position);
    range.setEnd(iter, position + needle.length);
    // At this point, range points at the first occurence of `needle`
    // in `haystack`. You can now delete it, replace it with something
    // else, and so on, and after that, set your original string to the
    // innerHTML of the document fragment representing the range.
    console.log(range);
}

JSFiddle.

Regular expressions can't be reliably used to parse HTML as HTML isn't a regular language (per the wonderful [*reference in the OP*](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)). e.g. attribute values within tags can look just like tags, but they aren't. — RobG, Nov 03 '15 at 22:40
No one argues that you can use regexes to *parse* HTML and get a DOM tree in the end. However, for some limited real-world usecases (like differentiating tags and text content), regular expressions work just fine. Even so, I've edited my answer to prevent further misunderstandings. — Mihai, Nov 03 '15 at 22:57

score 0 · Answer 3 · answered Nov 03 '15 at 21:44

0

Here is a little function I came up with:

function customSearch(haysack,needle){
    var start = 0;
    var a = haysack.indexOf(needle,start);
    var b = haysack.indexOf('<',start);

    while(b < a && b != -1){
        start = haysack.indexOf('>',b) + 1;
        a = haysack.indexOf(needle,start);
        b = haysack.indexOf('<',start);
    }

    return a;
}

It returns the results you expected based in your examples. Here is a JSFiddle where the results are logged in the console.

answered Nov 03 '15 at 21:44

Alvaro Flaño Larrondo

5,516
2
27
46

1

It needs another check to make sure, when inside a tag, and checking for '>', that you're not also inside a string (attribute value). See my comment on Rounin's answer. – Wilson F Nov 03 '15 at 22:09

Rounin · Answer 4 · 2015-11-03T22:19:58.363

0

Let's start with your third example:

var desiredSubString = 'span';
var entireString = '<span class="no-text"></span><span class="some-class"><span class="other-class">This is my spanned string</span></span>';

Remove all HTML elements from entireString, above, to establish textString:

var textString = entireString.replace(/(data-([^"]+"[^"]+")/ig,"");
textString = textString.replace(/(<([^>]+)>)/ig,"");

You can then find the index of the start of the textString within the entireString:

var indexOfTextString = entireString.indexOf(textString);

Then you can find the index of the start of the substring you're looking for within the textString:

var indexOfSubStringWithinTextString = textString.indexOf(desiredSubString);

Finally you can add indexOfTextString and indexOfSubStringWithinTextString together:

var indexOfSubString = indexOfTextString + indexOfSubStringWithinTextString;

Putting it all together:

var entireString = '<span class="no-text"></span><span class="some-class"><span class="other-class">This is my spanned string</span></span>';
var desiredSubString = 'span';

var textString = entireString.replace(/(data-([^"]+"[^"]+")/ig,"");
textString = textString.replace(/(<([^>]+)>)/ig,"");

var indexOfTextString = entireString.indexOf(textString);
var indexOfSubStringWithinTextString = textString.indexOf(desiredSubString);
var indexOfSubString = indexOfTextString + indexOfSubStringWithinTextString;

edited Nov 03 '15 at 22:19

answered Nov 03 '15 at 21:49

Rounin

27,134
9
83
108

2

One issue I can see, with this line: `var textString = entireString.replace(/(<([^>]+)>)/ig,"");` There's no guarantee that an attribute value won't contain a '>' character, as in: ``. This is one reason why trying to parse HTML with regex has been reputed to drive otherwise sane people mad. : ) – Wilson F Nov 03 '15 at 22:03
Yes, I see. The regex syntax above is not sophisticated enough to be able to ignore a closing angle-bracket appearing in an attribute value. Perhaps something like: `var textString = entireString.replace(/(data([^"]+"[^"]+")/ig,""); textString = textString.replace(/(<([^>]+)>)/ig,"");` ?? (See my edit above...) – Rounin Nov 03 '15 at 22:16
Does that remove anything with quotes around it before removing tags? If so, I'm afraid you now have two (further) problems: 1) no guarantee that the plain text string won't have quotation marks in it (`'This is my "spanned" string'`); 2) no guarantee that attribute values aren't using single quote characters ( `"This is my \"spanned\" string"` ). As the link in my original question says, because HTML isn't a 'regular language', regular expressions almost certainly can't handle it. – Wilson F Nov 03 '15 at 22:28
Yes, I'm beginning to see what a minefield this is. The regex above doesn't remove anything with quotes around it before removing tags, no. It removes any attribute within an HTML element, which begins with `data` (or `data-` in the edited post above). However it will need a middle line to handle data-* attributes which employ single rather than double quotes: textString = textString.replace(/(data-([^\']+\'[^\']+\')/ig,""); – Rounin Nov 03 '15 at 22:36
To get just the text, use the host's own HTML parser (if it has one). Create a suitable element like a DIV, insert the markup as its innerHTML, then get the *textContent*. – RobG Nov 03 '15 at 22:43
@Rounin : I really hate to keep pouring cold water on your efforts. However, as my last reply shows, it's not just `data-` attributes that can be affected (see the `title` attribute in the `span` in that example; there are others that could conceivably have a `>` character in their strings, too, especially for pages using a framework like Angular). Also, there's no guarantee that the plain text couldn't have the character string `data` in it, followed (at some point) by a passage in quotation marks. – Wilson F Nov 03 '15 at 23:28
Yes, it's no problem - I absolutely grasp the point you're illustrating. :-) – Rounin Nov 04 '15 at 09:15

How to find a substring only in the text portion of an HTML string, with Javascript?

4 Answers4