0

I have the following scenario, I need to change,

<a href="ab/xyz" onclick="ab/123"></a>

to

<a href="pq/xyz" onclick="pq/123"></a>

basically replace "ab" with "pq", whenever "ab" appears in attribute values of a html tag

I wrote the following regex,

(<[^>]+)((=")(ab)([^>/"]*"))+([^>].*>)

and I am doing replaceAll

if(matcher.find())
matcher.ReplaceAll($1$3pq$4$5)

The above code only replaces one attribute value per tag even though I have repetition operator in my regex and I am doing ReplaceAll

If I change the "if" condition to while loop, then it changes all attributes, basically 1 attribute per iteration

Is there a way to just replace all matches in all attribute values without a loop?

Solution: A dumb regex is doing the trick even without repetition operator. Problem was I was matching the entire tag.

user1810502
  • 531
  • 2
  • 7
  • 19
  • 1
    http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la – Cfreak Feb 20 '14 at 22:53
  • How about just using matcher.replaceAll("ab", "pq"); (could also be just Replace, haven't used either in a while) – Boyen Feb 20 '14 at 22:54
  • @Boyen That does only work if you have some way to get all attributes first, otherwise you'd replace any ab, in the page, not only the ones inside attribute values. As it seems the OP is using the regexp on the whole sourcecode (sure this isn't the best thing to do performance wise, but there may be a reason for this), it's not that simple. – Johannes H. Feb 20 '14 at 22:56
  • @Cfreak totally unrelated BTW. Neither should this validate or parse HTML, nor are RexExps the way they are implemented in Java equal to "regular expressions" as understood by computer science (see first comment on the quesiton you linked to). – Johannes H. Feb 21 '14 at 11:29

1 Answers1

0

It replaces only one occurence, because the .* at the end matches the entire length of your stirng (well, everything up to the last >, but most likely that is the end of the document since it'll end with html>) - and there is no other match behind that.

Java supports lookaheads and lookbehinds, we'll need those to make it work. Basically, a lookahead tells Java to "only match if the match is followed by whatever, but whatever is not part of the match itself". Lookbehinds are ther same, just that whatever has to precede the match. Unfortunately Java doesn't support * and + inside lookbehinds, so they're a little tricky, but it should work:

([^>]*?="[^"]*?)ab(?=[^<]*>)

replace it by $1pq.

I tested it, it works - but only replaces one ab inside each attribute (the first one). If you have multiple abs in one attribute and all shoudl be replaced, I see no way (without proper lookbehinds)

Note that this is assuming valid HTML - it may yield unexpected results on invalid HTML.

Johannes H.
  • 5,875
  • 1
  • 20
  • 40
  • I did that but still the same, without a while loop it only changes 1 attribute per tag in the html – user1810502 Feb 21 '14 at 00:24
  • It does, for the same reason. the regexp matches the entire tag. It is only looking for other matches AFTER the first match. Ou'll need lookaheads and lookbehinds to make it work - but I'm not sure if Javascript supports those. If it doesn't, there is no way. (YOur original code should have only replaced one tag per page, not one per tag) – Johannes H. Feb 21 '14 at 11:15
  • erm... that meant to be Java, autocorrect did that ;) Fear not, I AM looking for the correct language ;) – Johannes H. Feb 21 '14 at 11:22