Text Processing - Detecting if you are inside an HTML tag in Java

Question

I have a program that does text processing on a html formatted document based on information on the same document without the html information. I basically, locate a word or phrase in the unformatted document, then find the corresponding word in the formatted document and alter the appearance of the word or phrase using HTML tags to make it stick out (e.g. bold it or change its color).

Here is my problem. Occasionally, I want to do formatting to a word or phrase which might be part of a html tag (for example perhaps I want to do some formatting to the word "font" but only if is a word that is not inside an html tag). Is there an easy way to detect whether a string is part of an html tag in a block of text or not?

By the way, I can't just strip out the html tags in the document and do my processing on the remaining text because I need to preserve the html in the result. I need to add to the existing html but I need to reliably distinguish between strings that are part of tags and strings that are not.

Any ideas?

Thank you,

Elliott

score 1 · Accepted Answer · answered Apr 08 '11 at 22:42

You could do a few things

Write a regular expression for what you're doing. There are plenty of prewritten ones you can find on Google
Find a library to parse the document (e.g., http://htmlparser.sourceforge.net/) and only replace text

The first is likely to the be the fastest and easiest, but the second will be more reliable.

score 0 · Answer 2 · answered Apr 08 '11 at 22:40

0

Use the following regex code to detect if it has HTML tags: "\<.*?\>"

And here you can learn how to effectively use regex in your java code. Happy coding ;)

answered Apr 08 '11 at 22:40

Hallaghan

1,910
6
32
47

1

You have apparently never seen [the highest-upvoted answer ever](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – Mike Daniels Apr 08 '11 at 23:02
@Mike Daniels: No, but I'm looking at it now ;) Thanks for the pointers, that should come in handy for me as well. – Hallaghan Apr 08 '11 at 23:04

score 0 · Answer 3 · edited May 23 '17 at 12:04

0

If you have parsed the DOM, what you have, if you are doing it correctly. Then ask the super tag that contains current tag, and keep doing that, if that is not the tag, that you are looking for.

If you use some custom search or regex to parse html, then check best answe for this question:

RegEx match open tags except XHTML self-contained tags (It has +4000 upvotes for a reason)

edited May 23 '17 at 12:04

Community

1
1

answered Apr 08 '11 at 22:47

Margus

19,694
14
55
103

Text Processing - Detecting if you are inside an HTML tag in Java

3 Answers3