How can I index a string using its pre-sanitized indices?

Question

I have a string definition, in which HTML can appear, and an array of words. I am trying to search for these words in the definition and return the start and the end positions. For example, I might want to find "Hello" in:

definition = "<strong>Hel</strong>lo World!"

Getting rid of the HTML can be done using sanitize from ActionView and HTMLEntities, but that changes the index of "Hello" in the string, so:

sanitized_definition.index("Hello")

will return 0. I need the start point to be 8, and the end point 21. I thought about mapping the entire string to my own indices like

{"1" => '<', "2" => 's', "3" => 't', .. , "9" => 'H' ...}

so that 1 maps to the first character, 2 to the second, and so on, but I'm not sure what that accomplishes, and it seems overly complicated. Does anyone have any ideas how to accomplish this?

EDIT:

Good point in the comments that it doesn't make sense that I want to include the </strong>, but not the <strong> at the beginning, partially because I haven't figured out what to do with that edge case. For the purposes of this question, a better example might be something like

definition = "Probati<strong>onary Peri</strong>od."
search_text = 'Probationary Period'

Also, after thinking about it a little bit more, I think in my particular case, the only html entity that I need to worry about is  .

Can you post an actual example of `definition`? It doesn't have to be overly long but it's easier for people to test their answers against something real rather than guess at what you are working with. — Beartech, Aug 03 '15 at 21:04
What is the higher-level problem you're trying to solve? Maybe the solution you are trying to implement here is not the only/best one. — Michał Szajbe, Aug 03 '15 at 21:08
@MichałSzajbe Trying to give functionality similar to wikipedia style links between articles, where users can use markup to indicate a link by adding brackets. But we also have the ability to automatically add the markup if a matching name is detected. — Mangesh Tamhankar, Aug 03 '15 at 21:40
Question is not clear. What is the logic that makes you not include `` as part of the match but `` as part of the match? — sawa, Aug 03 '15 at 22:47
It is not clear what your indices are pointing to in your attempted hash. If `"1"` goes to `'<'`, it appears to me that `"8"` should go to `'>'`, but you have `"8"` going to `'H'`. — sawa, Aug 03 '15 at 22:49

Cary Swoveland · Accepted Answer · 2015-08-03T21:34:11.167

4

I confess I don't know much about HTML. I've assumed that each adjacent pair of letters of the target word (here 'Hello') is separated by zero or more strings bracketed by < and > and nothing else (but don't know if that is correct).

def doit(str, word)
  r = Regexp.new(word.chars.join('(?:<.*?>)*'))
  ndx = str.index(r)
  ndx ? [ndx, ndx+str[r].size-1] : nil
end

doit "<strong>Hel</strong>lo World!", "Hello" 
  #=> [8,21]

Here's what happening:

str  = "<strong>Hel</strong>lo World!"
word = "Hello"

a = word.chars
  #=> ["H", "e", "l", "l", "o"] 
s = a.join('(?:<.*?>)*')
  #=> "H(?:<.*?>)*e(?:<.*?>)*l(?:<.*?>)*l(?:<.*?>)*o" 
r = Regexp.new(s)
  #=> /H(?:<.*?>)*e(?:<.*?>)*l(?:<.*?>)*l(?:<.*?>)*o/ 
ndx = str.index(r)
  #=> 8 
t = str[r]
  #=> "Hel</strong>lo" 
o = t.size-1
  #=> 13 
ndx ? [ndx, ndx+str[r].size-1] : nil
  #=> 8 ? [8, 8 + t.size-1] : nil
  #=> [8, 8 + 14 -1] 
  #=> [8, 21]

edited Aug 03 '15 at 21:34

answered Aug 03 '15 at 21:19

Cary Swoveland

106,649
6
63
100

This is clever, and I'm tempted to accept it because I think I can use the general concept to do what I'm trying to do, but there are a few issues. [html entities](http://www.w3schools.com/html/html_entities.asp) are not taken into account here, but should be able to modify the regexp accordingly. – Mangesh Tamhankar Aug 03 '15 at 21:36
That was going to be my question; since Cary's answer lets you scan the unsanitized string directly does that get rid of the need to scan the sanitized version? Certainly a more elegant solution than I was thinking of. – Beartech Aug 03 '15 at 21:42
Perhaps someone familar with HTML (that's probably everyone but me) could post a solution that uses this idea but implements it properly. – Cary Swoveland Aug 03 '15 at 21:43
I am in awe of your regex power. The only thing I can see other than what the OP said about entities, is that it needs to find the string in any case, regardless of the HTML tags. So if they were searching for "world" in the above case it would fail. Basically doing it your way you need a regex that makes the search blind to anything contained in `<..>` . Then the only edge cases would be entities and any escaped `<..>` that are part of the text of a page and not the HTML. – Beartech Aug 03 '15 at 21:52
@Beartech I dont think it fails to find "world", the key is zero or more bracketed segments. So no html tags is okay. – Mangesh Tamhankar Aug 03 '15 at 21:57
Ah! good catch. I tested it in `pry` but searched for "world" rather than "World". Note to OP it will be case sensitive unless you tell the Regex to be case insensitive. – Beartech Aug 03 '15 at 21:59
@CarySwoveland Perhaps you can help with the regexp here. Html entities start with '&' and end with a ';'. Between the two you can have characters, or a # followed by numbers. So something like "&[a-z0-9#]+;"? – Mangesh Tamhankar Aug 03 '15 at 21:59
I'll look at that this evening, but if I understand you correctly you could play around with `a.join('(?:(?:<.*?>)|&(?:[a-z]+|#[0-9]+);)*')`. (`#` may need to be escaped.) I assume you meant, "Between the two you can have *only lowercase letters*, or a # followed by one or more *digits*. – Cary Swoveland Aug 03 '15 at 22:32
Watch out, this is getting dangerously close to [parsing HTML with regular expressions](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – tompave Aug 03 '15 at 22:51
@CarySwoveland That does look okay, but it gets a little complicated since the entities actually translate to characters. Best approach might be to use something like HTMLEntities gem to decode these to characters first, then use your original answer. – Mangesh Tamhankar Aug 04 '15 at 01:10
I'll take your word for it. – Cary Swoveland Aug 04 '15 at 01:15

How can I index a string using its pre-sanitized indices?

1 Answers1