2

The user will provide some search term. Let's just say it's a string, that may contain any words or special characters (like /, ?, ,, $, *, etc).

I need to match this sequence of characters anywhere they appear in HTML, even if the search term crosses sequential spans; in my HTML, special characters are sometimes wrapped separately.

For example: the user provides "Your mom?", and there is a paragraph which contains <span>Your mom</span><span class="special">?</span>

I need an effective way to determine that a) the query does exist, and b) which elements contain the query. The searched text can be complex HTML, and contain LOTS of words, spans, divs, etc.

Michael Gaskill
  • 7,913
  • 10
  • 38
  • 43
neaumusic
  • 10,027
  • 9
  • 55
  • 83

3 Answers3

2

I might try and write a parser that can differentiate between tags opening and closing, and their text content (hopefully the HTML does not contain incomplete tags). For indexing, perhaps you could use a stack of tuples, each representing depth and count, and memory of the current state. Your simple example would index as:

[(1,1)] tag opens, text: 'Your mom'
query text matches so far
[(1,1),(1,1)] tag closes, remove. 
[(1,2)] tag opens, maintain depth, increase count, text: '?'
query text continues to match
[(1,2)] tag closes, remove
גלעד ברקן
  • 23,602
  • 3
  • 25
  • 61
  • i think using a stack is the correct approach, i gotta figure out how to do partial matching -- "text matches so far" in your explanation – neaumusic May 19 '16 at 06:22
  • @neaumusic one way to do partial matching is simply comparing char by char and remembering the last char index. – גלעד ברקן May 19 '16 at 10:24
1

You will first have to split up your "characters" into groups. The most paranoid way to do this would be by character, but that will end up being very inefficient. Knowing what little I do about your data, I assume that anything that matches [a-zA-Z\s]+ becomes one token and everything else becomes another.

The other thing that might be logical to do is do an iterative process, where after each failed attempt, you break it down further.

No matter what you decide, you'd need to use some JavaScript to do this. But that should be fairly easy to do.

After splitting it up, you would need to begin thinking about constructing a regex.

You can put (?:<[^>]*>\s*)* between each token, but certain characters would need to be escaped before you put them in the regex. There is a complete list somewhere, but that would include: $^*.+?/\{}[]().

For your example, you might end up with something like this:

/your mom(?:<[^>]*>\s*)*\?/i

With i meaning case insensitive.

You can get the index of the match location like this:

var match = /regex/.exec("string to match against");
if (match) {
    alert("match found at " + match.index);
}
Community
  • 1
  • 1
Laurel
  • 5,965
  • 14
  • 31
  • 57
1

This solution will find and return the first element that contains the search text, even if that text contains embedded tags.

TL;DR Play with the example!

var content = $("#content");
var search = $("#search");
var go = $("#go");

function escapeRegExp(str) {
  return str.replace(/[\/\\{}()*+?.^$|[\]-]/g, "\\$&");
}

function recursiveElementSearch(regex, element) {
  var text = element.text();
  
  if (text.match(regex)) {
    var children = element.children();
    var len = children.length;
    
    for (var i=0; i < len; ++i) {
      var child = $(children[i]);
      var found = recursiveElementSearch(regex, child);
      
      if (found != null) {
        return found;
      }
    }
    
    return element;
  }
  
  return null;
}

go.click(function() {
  var value = $.map(search.val().split(""), function(value, index) {
    return escapeRegExp(value);
  });
  var regex = new RegExp(value.join(""), "i");
  var element = recursiveElementSearch(regex, content);

  console.log("Element: ", element ? element.attr("id") : "null");
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script>
<div id="content">
  <div id="first">
    <span id="a">Your mom</span><span class="special">?</span>
  </div>
  <div id="second">
    <span id="b">Where is <strong>your</strong> mom</span><span class="special">?</span>
  </div>
  <div id="third">
    <span id="c">Yours<span>&nbsp;</span><a href="#">mom</a></span><span class="special">?</span>
  </div>
  <div id="fourth">
    <span id="d">My mom</span><span class="special">!</span>
  </div>
  <div id="fifth">
    <span id="e">Their mom<i>s</i></span><span class="special">...</span>
  </div>
</div>

<input id="search" value="Your mom?">
<label for="search">Search:</label>
<button id="go">Go!</button>

The way that this works is that the input text is sanitized (escaped), and then the text of each element is recursively checked to see if the search text is contained.

The element that will be returned is the first element found at the deepest level. The search is depth-first, so a match found 3 levels deep on the first element will be returned before an element 1 level deep on the second element.

The HTML snippet provided shows that nested tags aren't an issue. Using this HTML, the search result for "Your mom?" returns div id="first", the search for "mom!" returns div id="fourth", and the search for "Yours mom" returns div id="c".

There are a few simple improvements that could be made. Here's what I'd see useful from my testing:

  • Collapse spaces in the search text to match any number of spaces (e.g. search for "Yours mom" should be the same as searching for "Yours mom")
  • Handle Unicode spaces, including "&nbsp;"
  • Include a version that returns all matches (e.g. searching for "your mom?" should return [div id="first", div id="second"], as both match)

Given all of that, this is a pretty useful way to search the text on a page.

Michael Gaskill
  • 7,913
  • 10
  • 38
  • 43