4

My experience with Regex is a little more than intro, so this is a challenge. Perhaps some math/physics/someone can figure it out...

We have to wrap certain words/phrases with a <span class="tooltip"></span> so that a relevant tooltip is displayed for the contents of the span. The challenge comes in how to avoid not wrapping a word twice if it is part of another phrase that was already wrapped.

The example: "Use Twitter Analyzer for analytics".

Both Twitter and Twitter Analyzer have tooltips, but only the Twitter Analyzer needs to be wrapped in the above. This is achieved by ensuring we search for the longest phrases first.

How do you prevent (using only Regular Expressions) the shorter phrase of the two from being wrapped again if it is already wrapped in another span?

Furthermore, Twitter and Twitter Analytics are only two examples of an entire list, so it needs to be generic.

Any ideas?

lordg
  • 520
  • 5
  • 25
  • You can't do it just with RegEx you should use serverside search script that will find your phrases from a given list. – Alex Rashkov Nov 10 '10 at 18:25
  • 1
    You can't match balanced 'things' (such as '...' with regular expressions. You need to do this to see if words are already inside a span. Use a different approach. – The Archetypal Paul Nov 10 '10 at 18:28
  • Paul, i'm not sure about not matching balanced things, as you can use regular expressions to extract tags from a string. – lordg Nov 10 '10 at 19:03
  • Thanks infinity. You identified an error in the Q. i meant Regular Expressions with PHP. – lordg Nov 10 '10 at 19:04
  • 1
    I think they're confused by the code snippet in your question. Answerers please keep in mind that he is looking for regular language phrases which will be replaced by the phrase wrapped with the code snippet. Though I agree that this isn't a job for regular expressions. And this should **not** be done on every page load either. This should be done when the content changes and cached server-side. – sholsinger Nov 10 '10 at 19:24
  • sholsinger, thanks for providing further clarity. Thats pretty spot on. Furthermore, I definitely agree with you on the every page load. In this project, it is actually "compiling" resources which will be saved to disk and used later. So they will rarely change. – lordg Nov 10 '10 at 19:47

5 Answers5

2

And now for the obligatory "you can't parse HTML with regex" link: RegEx match open tags except XHTML self-contained tags

Community
  • 1
  • 1
slebetman
  • 109,858
  • 19
  • 140
  • 171
2

I think your best bet is to match individual phrases you are looking for, and for each hit, save the string offset for the beginning of the match. Once you have built your list of offsets, sort the offsets from lowest to highest. For each offset in the list, compute the end offset of the string by adding the string length. If any of the later items in the list have an offset less than this new offset, remove them. If two offsets in the list are the same, take the longer of the two strings and throw the other out.

In your given example, the offset would be 4 for "Twitter Analyzer" and 4 for "Twitter" For the sake of demonstration, say you were also interested in "Analyzer" which has an offset of 12. The sorted list would be:

offset 4 - Twitter Analyzer - length 16
offset 4 - Twitter          - length 7
offset 12 - Analyzer        - length 8

since there are two 4's, throw out the one with the shorter length. Then add the length of "twitter analyzer" to its offset to get 20. Any offsets less than 20 but greater than 4 get thrown out.

To insert the string, retain your list of start and end offsets and start at the end of the list. At end offsets insert a "</span>" and at begin offsets insert "<span class="tooltip>" Move backward in the string until you reach the front. This will allow you to make the substitutions without the need to recalculate offsets.

Wade Tandy
  • 4,026
  • 3
  • 23
  • 31
  • Wade, good suggestion. In the end, the solution I used was a string based one. Unfortunately I had to deal with multi byte characters too, and it just got too crazy to do this with regex. – lordg Nov 18 '10 at 16:08
  • This is a good solution to a difficult problem, and does it efficiently. +1 - you actually solved the real programming challenge here. – Iiridayn Nov 19 '10 at 04:43
1

You cannot do this using only regex. Regular expressions cannot match for an arbitrary number of balanced opening and closing tags (because this doesn't form a regular language). You will need to perform the count yourself.

Oliver Charlesworth
  • 267,707
  • 33
  • 569
  • 680
  • The problem with this common wisdom is that it ignores the fact that many regex implementations do far more than "regular" expressions, including recursion. PCRE (which PHP uses) does recursion, and therefore supports balanced tags. It's still ill-advised to use regex to parse *arbitrary* HTML, but where the developer knows the input format, it can be done right... with care. – eyelidlessness Nov 11 '10 at 00:43
  • @eyelidlessness: Indeed, but these extensions aren't regular expressions any more! – Oliver Charlesworth Nov 11 '10 at 08:26
  • @Oli, implementations that do far more than regular expressions aren't regular expressions anymore? Who knew! – eyelidlessness Nov 12 '10 at 19:17
  • @eyelidlessness: Well, by definition, no! If it doesn't describe a regular language... – Oliver Charlesworth Nov 12 '10 at 21:05
  • @Oli, that... was the joke. ;) – eyelidlessness Nov 13 '10 at 00:49
1

If you can store the list to be matched in regex form, you could use negative lookahead to ensure each match is distinct. You would need access to PCRE functions. And an example:

$match = array('/Twitter(?! Analyzer)/', '/Twitter Analyzer/');
$replace = '<span class="tooltip">\0</span>';
$output = preg_replace($match, $replace, $input);

I probably don't need to mention that this will make maintaining your match list more difficult.

Iiridayn
  • 1,747
  • 21
  • 43
  • This is probably the most sensible approach, and could likely be automated by searching within the list for items which contain other items. – eyelidlessness Nov 11 '10 at 00:44
  • One problem with this approach is where two items will overlap (eg. "Foo Bar" and "Bar Baz"), but this will be a problem with any approach. – eyelidlessness Nov 11 '10 at 00:46
  • Given the objectives of this question, this answer I find to be a great solution. Some extra maintenance, but still workable. – lordg Nov 18 '10 at 16:07
  • Upon further reading of your link posted Michaelc, what about using Negative Lookbehind? If you search for '/(?<!)Twitter/', you should only get Twitter where it does not have a start tag just before it. With that said, eyelidlessness is correct with the Foo Bar overlap. – lordg Nov 18 '10 at 16:20
0

Michaelc gave a good suggestion to use negative lookahead. What about negative lookbehind?

You should then get away with:

$match = '/(?<!\<span class="tooltip">)Twitter/';
$replace = '<span class="tooltip">\0</span>';
$output = preg_replace($match, $replace, $input);

We wouldn't need to maintain the matchlist and could build a match item as we go through the word/phrase list. Down side is, like what eyelidlessness said, you will have a problem with overlaps like "Foo Bar" and "Bar Baz". Yet, you could interrogate the matches found to see if they don't contain a <span class="tooltip"> or a </span>. Not 100% accurate though.

Comments?

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
lordg
  • 520
  • 5
  • 25
  • It's a good improvement, but I think Wade beat us both by solving your actual problem. As I understand, his solution should also work fine for multibyte characters as well. – Iiridayn Nov 19 '10 at 04:43
  • Yes, you are correct. Was interesting to see that you could solve it (not 100% exactly though) with regex. Thanks for your suggestions and participation. – lordg Nov 19 '10 at 14:52