Why this regex not giving expected output?

Question

i have string which contains some value as given below. i want to replace the html img tags containing specific customerId with some new text. i tried small java program which is not giving me expected output.here is the program info

My input string is

 String inputText = "Starting here.. <img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123/></p>"
    + "<p>someText</p><img src=\"getCustomers.do?custCode=2&customerId=3340&param2=456/> ..Ending here";

Regex is

  String regex = "(?s)\\<img.*?customerId=3340.*?>";

new text i want to put inside input string

EDIT Starts:

String newText = "<img src=\"getCustomerNew.do\">";

EDIT ENDS:

now i am doing

  String outputText = inputText.replaceAll(regex, newText);

output is

 Starting here.. Replacing Text ..Ending here

but my expected output is

 Starting here.. <img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123/></p><p>someText</p>Replacing Text ..Ending here

Please note in my expected output only img tag which is containing customerId=3340 got replaced with Replacing Text. i am not getting why in the output i am getting both the img tags are getting replced?

you are parsing html with regex that just never works fully (this is a limit on regex in general not your regexing skills) — ratchet freak, Dec 13 '12 at 18:07
@ Some1.Kill.The.DJ Can you help me how can i get expected outcome with html parsers like jsoup? — M Sach, Dec 13 '12 at 18:27
M Sach you can see my answer for a complete example of jsoup working. — Vicent, Dec 14 '12 at 10:55

score 4 · Answer 1 · answered Dec 13 '12 at 18:18

4

You've got "wildcard"/"any" patterns (.*) in there which will extend the match to the longest possible matching string, and the last fixed text in the pattern is a > character, which therefore matches the last > character in the input text, i.e. the very last one!

You should be able to fix this by changing the .* parts to something like [^>]+ so that the matching won't span past the first > character.

Parsing HTML with regular expressions is bound to cause pain.

answered Dec 13 '12 at 18:18

Greg A. Woods

2,663
29
26

@Greg can i get expected output with jsoup library? – M Sach Dec 13 '12 at 18:25
`.*?` isn't any different from `.*` actually. zero or more matches of zero or more characters is, well, zero or more characters, including any number of `>` characters. – Greg A. Woods Dec 13 '12 at 18:26
I don't do Java, sorry -- I just spotted a typical RE design error. – Greg A. Woods Dec 13 '12 at 18:27
are you saying something like this "(?s)\\]+?customerId=3340[^>]+?>" ? It seems to be working but not sure you was trying to convey same regex? – M Sach Dec 13 '12 at 18:40
@GregA.Woods `.*?` is not the same as `(.*)?`. If you have a `?` after a repetition quantifier you make it ungreedy. That *does* make a difference. [Further reading](http://www.regular-expressions.info/repeat.html) – Martin Ender Dec 13 '12 at 18:42
The proof is in what was matched, not what should have matched. (There are no parens in the original RE, but on second thought that shouldn't matter.) See the section titled "An Alternative to Laziness" in the link you gave. – Greg A. Woods Dec 13 '12 at 19:09
I can't find any claim as to exactly what RE syntax Java supports, other than it's own, but in case it's POSIX ERE, well POSIX ERE's don't support lazy quantifiers. I guess Java is at least close to POSIX ERE given what matched in the example above. – Greg A. Woods Dec 13 '12 at 19:20
@m.buettner, Where did he say `.*?` is the same as `(.*)?`? Was there an edit that I'm not seeing? I think his point is that the greediness of `.*` versus `.*?` is irrelevant. The problem is that the match is starting too early, and the solution is to dump `.*` altogether and use `[^>]*` instead. – Alan Moore Dec 13 '12 at 21:58
@AlanMoore, I took "`.*?` isn't any different from `.*` actually. zero or more matches of zero or more characters" as Greg saying that `?` is the optional quantifier... which it isn't in this case. The reason why ungreediness doesn't make a difference here is that ungreediness is only ungreedy with respect to the right end of the match. The left end of the match will always be greedy. – Martin Ender Dec 13 '12 at 22:02
@GregA.Woods are we on the same page that (in general) `.*?` is different from `.*`... just that the former won't the trick in this case? I never said your answer isn't right - because it is. I just found your comment misleading ;). Plus your explanation isn't 100% accurate, because you say that `.*` will take the longest possible match, but that is not what the OP uses. The reason why he still gets the long match, although he uses `.*?` is the actual problem here. In any case a negated character class fixes it, of course. – Martin Ender Dec 13 '12 at 22:06
@m.buettner It is clear from the example that whatever regex package @M is using does not implement lazy quantifiers, i.e. the `?` in `.*?` has zero effect and you can see that by what was replaced, i.e. what was matched (and it would likely have no effect in `(.*)?` either, unless there's a bug in the regex package and the grouping somehow turns on the lazy matching feature -- `?` is always supposed to be a lazy quantifier, IFF the implementation supports lazy quantifiers) – Greg A. Woods Dec 13 '12 at 22:57
@GregA.Woods there is no regex engine (known to me) in which a lazy quantifier would have made a difference here. all that laziness does is **end** the match as soon as possible. however laziness can *never* affect the beginning of the match. every regex engine will always return the leftmost possible beginning of a match. the first ` – Martin Ender Dec 13 '12 at 23:06
I agree -- though that seemed to be @M's initial intent. – Greg A. Woods Dec 14 '12 at 02:13

score 1 · Accepted Answer · edited May 23 '17 at 12:21

1

As other people have told you in the comments, HTML is not a regular language so using regex for manipulating it is usually painful. Your best option is to use an HTML parser. I haven't used Jsoup before, but googling a little bit it seems you need something like:

import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;

public class MyJsoupExample {
    public static void main(String args[]) {
        String inputText = "<html><head></head><body><p><img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123\"/></p>"
            + "<p>someText <img src=\"getCustomers.do?custCode=2&customerId=3340&param2=456\"/></p></body></html>";
        Document doc = Jsoup.parse(inputText);
        Elements myImgs = doc.select("img[src*=customerId=3340");
        for (Element element : myImgs) {
            element.replaceWith(new TextNode("my replaced text", ""));
        }
        System.out.println(doc.toString());
    }
}

Basically the code gets the list of img nodes with a src attribute containing a given string

Elements myImgs = doc.select("img[src*=customerId=3340");

then loop over the list and replace those nodes with some text.

UPDATE

If you don't want to replace the whole img node with text but instead you need to give a new value to its src attribute then you can replace the block of the for loop with:

element.attr("src", "my new value"));

or if you want to change just a part of the src value then you can do:

String srcValue = element.attr("src");
element.attr("src", srcValue.replace("getCustomers.do", "getCustonerNew.do"));

which is very similar to what I posted in this thread.

edited May 23 '17 at 12:21

Community

1
1

answered Dec 13 '12 at 19:52

Vicent

5,322
2
28
36

Vicent. It works good. But i am getting one issue.Instead of "my replaced text", Use "" jsoup make the element like this <img src="getCustomerNew.do"/> instead of ; – M Sach Dec 15 '12 at 09:15
looks like it is doing encoding characters like <," how can i stop this? – M Sach Dec 15 '12 at 09:16
So you don't want replace the whole img node just the value of the src attribute? – Vicent Dec 15 '12 at 09:28
i want to replace the whole image tag only but with a new image tag.My new image is "". The point is old image tag get replaced with new image tag but when i do doc.toString() i see new image tag as <img src="getCustomerNew.do"/> instead of – M Sach Dec 15 '12 at 09:56
AFAIK it makes no sense to replace an existing node with a node of the same tag just for changing the value of its attributes. The proper way to do it is simply change the attributes (I've simplified my UPDATE). I suppose that you are getting those < and similar entities because you keep using the new TextNode() part of the code which can't be used for creating any kind of nodes (for instance `img` nodes) but only text nodes. – Vicent Dec 15 '12 at 10:09
I've just seen your question updates. It is not clear to me if you want to get rid of the old value of the src attribute or you want just to change part of it. Anyway my last update cover both possibilities. – Vicent Dec 15 '12 at 12:38
Thanks Vicent. You have answered my question.I think its always good tested libraries like jsoup for html parsing instead of using regex which can behave in weird way in some of the scenarios. – M Sach Dec 15 '12 at 15:01
Nice to know it. I also strongly recommend you to use Jsoup instead of regex in [your previous question](http://stackoverflow.com/questions/13857509/regex-to-replace-the-specific-string-with-in-image-tag/13882807#13882807). – Vicent Dec 15 '12 at 15:25

score 0 · Answer 3 · answered Dec 15 '12 at 15:47

What happens is that your regex starts matching the first img tag then consumes everything (regardless is greedy or not) until it finds customerId=3340 and then continues consuming everything until it finds >.

If you want it to consume just the img with customerId=3340 think of what makes different this tag from other tags that it may match.

In this particular case, one possible solution is to look at what is behind that img tag using a look-behind operator (which doesn't consume a match). This regex will work:

String regex = "(?<=</p>)<img src=\".*?customerId=3340.*?>";

Why this regex not giving expected output?

3 Answers3

Linked