Regular expression to highlight a html document?

Question

I am building an android app which has a webview. The webview will display a html document returned from a server.

Depending on a search string i have to highlight few parts of the html document. If search string is 'hello world' then i have to mark text that matches the regex (hello)|(world*).

I tried this -

I get the html document from server. Search the text with regex using Pattern and Matcher. I replace the matched words with which makes it look like highlighted. Works great when there are no html tags. But screws it up when there are html tags in the document from webserver and when my search string matches one of these tags.

I hope i'm clear. Anybody can help?

Please post your regex and probably give more details what specifically does not work when HTML tags present. — AlexR, Dec 05 '11 at 12:14
cant you just check if the found string is a html tag ,eq search for the < and > chars — sherif, Dec 05 '11 at 12:14
This reminds me of something [about HTML and regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). :D — , Dec 05 '11 at 12:23
Have to quote this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Gray, Dec 05 '11 at 13:51

RokL · Accepted Answer · 2011-12-05T13:30:05.783

0

I recommend using a HTML parser then you only use regex on text nodes in the tree returned by the parser. Regex that would exclude the tags would be very complex, especially considering tags have attributes which can (in name or in value) cause your regex to match (not to mention javascript snippets.

In absence of HTML parser you should try regex: "<[^>]++>([^<]++)<[^>]++> and then take group 1 from result and do a replace with hello|world as regex.

edited Dec 05 '11 at 13:30

answered Dec 05 '11 at 12:17

RokL

2,663
3
22
26

your regex works great but what if i want to apply my regex only on what is inside tag? by html parser, do you mean Swing Parser? is it not too much for a phone to do? – Sudarshan Bhat Dec 06 '11 at 06:58
Easiest way is `int idx = html.indexOf(""); int idx2 = html.lastIndexOf("");` then adjust the bounds on regex matcher like this: `matcher.region(idx, idx2+7)` – RokL Dec 07 '11 at 09:13

score 0 · Answer 2 · edited Dec 05 '11 at 14:24

It should look like this, but in java ;) :

split1 = split string around '<'

for each element in split1 as s1:
 split2 = split s1 around '>'
 apply regex and replace on split2[1] 
 s1 = join split2 using '>' as glue
end for;

result = join split1 using '<' as glue

How it works: Your problem doesn't involve the content of the tags, you just want to find and replace text that is outside the tags, or between them. So by splitting the text first by < and then by > you will end up having the content of the tags in split2[0] and the text outside of the tags in split2[1], then you can operate on either part as you need

This technique can be used whenever you have to do simple operations on html text. But as soon as you need to identify tags and attributes it is best that you go for a html parser.

score 0 · Answer 3 · answered Dec 05 '11 at 12:45

0

if you made the server that returns the html. why don't you make to get them already highlighted?

If I understand well.. the problem is when you want to highlight an element that has the same pattern as a tag like: <a>

answered Dec 05 '11 at 12:45

Alex

2,126
3
25
47

Regular expression to highlight a html document?

3 Answers3