Hi I am trying to find RegEx which helps me to replace words in HTML. Problem occurs if the word i am trying to replace is in HTML tag as well.
Example:<img class="TEST">asd TEST asd dsa asd </img>
and i need to get the second "TEST" only.
RegEx i am looking for should look like >[^<]*TEST
, but this regex takes chars before the word TEST as well. Is it possible to select only word TEST ? but imagine other combinations as well (i dont think " TEST " is a good solution as soon as text could contain another chars as well)

- 3,525
- 2
- 23
- 30
-
3see http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Bala R Apr 21 '11 at 13:35
-
This is a job for a parser. Do a search for: "java html parser" and you will be on your way. – ridgerunner Apr 21 '11 at 15:33
3 Answers
First of all, regex is not good option for html parsing.. There are lots of enhanced html parsers that you can use..
But if you insist to use regex , here is the regex ;
(?<=>.*)TEST(?=.*<)
for java,
(?<=>.{0,100000})TEST(?=.{0,100000}<)
for more information why we can not use * or + with lookbehind regex in Java , Regex look-behind without obvious maximum length in Java

- 1
- 1

- 20,940
- 2
- 24
- 34
-
i am not parsing whole html, for that i use Jericho. I just wanted easy way of replacing some words. I cant make your regex working ...testing here http://myregexp.com/ – rhorvath Apr 21 '11 at 14:12
-
I like your solution, but not working for code like this: `
[newLine here] TEST [newLine here]
` – rhorvath Apr 22 '11 at 16:27
First of all, like has been said and will be said again, using regex for XML is usually a bad idea. But for really simple cases it can work, especially if you can live with sub-optimal results.
So, just put the test in a group and replace only the group
Something like
Pattern replacePattern = Pattern.compile(">[^<]*(TEST)");
Matcher matcher = replacePattern.matcher(theString);
String result = theString.substr(1,matcher.start(1)) + replacement + theString.substr(matcher.end(1));
Disclaimer: Not tested, might have some off-by-ones. But the concept should be clear.

- 16,947
- 4
- 41
- 53
How about if "TEST" is inside another tag than , like say inside the body tag, or for that matter inside the html tag?
-
ahh maybe i said it wrong way. i mean between '<' and '>'. it is okey if word is inside tag <> here >, not ok if its < here>. – rhorvath Apr 21 '11 at 15:28