1

Update: For those who mark this question as a duplicate of: I am searching for text which may be included in just one element or may be spread over a 100 elements. I do not know that prior to search. All I know is the words within the pattern that I'm searching came from this html. Now I need to do a search which skips (but remembers) the html/javascript which may be interdisperesed with my the text I'm looking for.

I hope this explanation helps find an answer to my question.

*********** End of Update ***************

I am looking for a library or a piece of code that would allow arbitrary plain text to be searched and located (start/stop offsets or tags) inside an html document.

Example:

  • pattern to look for: "text that I'm looking for"
  • html document:
<html>...<p>text that <b>I'm</b/> <span>looking
   for<div>...</div>...</p>
  • resulting match:

text that <b>I'm</b/> <span>looking for

Does anyone know of such utility? thanks

Jeff Saremi
  • 2,674
  • 3
  • 33
  • 57
  • Possible duplicate of [Selecting text in an element (akin to highlighting with your mouse)](https://stackoverflow.com/questions/985272/selecting-text-in-an-element-akin-to-highlighting-with-your-mouse) – Randy Casburn May 23 '18 at 15:50
  • In JQUERY you can target elements which have specific text with: jQuery( ":contains('How')" ).css("text-decoration", "underline") - which will underline the entire text in the element, where the element contains the word 'How'. Not sure if this helps, as it will underline all text in that element. maybe you could wrap your words individually in a span first. – developer May 23 '18 at 15:56
  • Possible duplicate of [How to highlight text using javascript](https://stackoverflow.com/questions/8644428/how-to-highlight-text-using-javascript) – Lewis May 23 '18 at 16:05
  • I have used this jQuery plugin in a project, fitted well https://github.com/knownasilya/jquery-highlight – Christian Benseler May 23 '18 at 17:33
  • @developer: I do not know the elements where my text could be included in. Unless I repeatedly search all the elements in the html for the text and try to stitch the results to see if all my text was found. I find this approach completely infeasible – Jeff Saremi May 23 '18 at 17:33
  • jQuery( ":contains("text that I'm looking for')" ).css("text-decoration", "underline") will find all elements that contain: "text that I'm looking for", and underline the entire text node. jQuery( ":contains(Text)" ).css("text-decoration", "underline") will find all text nodes, and underline the text. sorry if this doesn't help. – developer May 23 '18 at 17:36
  • @ChristianBenseler: Let's forget about highlighting. First I need to find the boundary where my text lies in and it's not going to be in one single elements' text – Jeff Saremi May 23 '18 at 17:36
  • @developer: looked at contains() again. This is definitely not what i'm looking for. Sorry I have a really hard time explaining what I want i guess. In person it would just take a minute but in writing it's hard – Jeff Saremi May 23 '18 at 17:43
  • @developer: this additional explanation may help: Look at my example. The text that I'm looking for does not exist in one contiguous instance at all. There is no html element that would contain the entire text that I'm looking for. There are several html elements -- lined up one after the other -- which have bits and pieces of the text I'm looking for. To the user, the text is probably shown contiguously with some formatting. This search method/engine must be smart enough to see formatting but skip over it – Jeff Saremi May 23 '18 at 17:45
  • I see your problem. I'll have a think, and see if i can suggest something – developer May 23 '18 at 17:48

1 Answers1

0

EDITED: Did some actual programming. This algorithm accepts HTML tags between characters and HTML tags and whitespace between words.

const haystack = '<html>This, <b>that</b>, and\nthe<i>other</i>.</html>';
const needle = 'This, that, and the other.';

// Make a regex from the needle...
let regex = '';

// ..split the needle into words...
const words = needle.split(/\s+/);
for (let i = 0; i < words.length; i++) {
  const word = words[i];

  // ...allow HTML tags after each character except the last one in a word...
  for (let i = 0; i < word.length - 1; i++) {
    regex += word.charAt(i) + '(<.+?>)*';
  }
  regex += word.charAt(word.length - 1);

  // ...allow a mixture of whitespace and HTML tags after each word except the last one
  if (i < words.length - 1) regex += '(\\s|(<.+?>))+';
}

// Find the match, if any
const matches = haystack.match(regex);
console.log(matches);

// Report results
if (matches) {
  const match = matches[0];
  const offset = matches.index;

  console.log('Found match!');
  console.log('Offset: ' + offset);
  console.log('Length: ' + match.length);
  console.log(match);
}
dsharhon
  • 489
  • 3
  • 11