1

I have a Html string which include lots of image tag, I need to get the tag and change it. for example:

String imageRegex = "(<img.+(src=\".+\").+/>){1}";
String str = "<img src=\"static/image/smiley/comcom/9.gif\" smilieid=\"296\" border=\"0\" alt=\"\" />hello world<img src=\"static/image/smiley/comcom/7.gif\" smilieid=\"294\" border=\"0\" alt=\"\" />";
Matcher matcher = Pattern.compile(imageRegex, Pattern.CASE_INSENSITIVE).matcher(msg);
int i = 0;
while (matcher.find()) {
    i++;
    Log.i("TAG", matcher.group());
}

the result is :

<img src="static/image/smiley/comcom/9.gif" smilieid="296" border="0" alt="" />hello world<img src="static/image/smiley/comcom/7.gif" smilieid="294" border="0" alt="" />

but it's not I want, I want the result is

<img src="static/image/smiley/comcom/9.gif" smilieid="296" border="0" alt="" />
<img src="static/image/smiley/comcom/7.gif" smilieid="294" border="0" alt="" /> 

what's wrong with my regular expression?

Mejonzhan
  • 2,374
  • 1
  • 20
  • 30
  • 2
    Can I refer you to this answer: http://stackoverflow.com/a/1732454/83109 – David M Jul 10 '12 at 13:14
  • Is there anything wrong with regexing out only tags though? – namenamename Jul 10 '12 at 13:20
  • Yes, there is. The problem is that HTML isn't a regular language, and so it's not a good candidate for analysis with a regular expression. Sometimes you can make it work in a pinch (this may be one of those cases), but it's a little like driving nails with an old shoe. It may get the job done, but it's not really the right tool. – Ian McLaird Jul 10 '12 at 13:23
  • As the comments to the question I've linked to say, there is a big difference between PARSING and MATCHING. I just like that answer. – David M Jul 10 '12 at 13:24
  • regular expression handle strings, the HTML is constructed by strings, why can't use regular expression to handle HTML? "HTML isn't a regular language" there is nothing with to do language, just strings, so why can't? – Mejonzhan Jul 10 '12 at 13:36
  • To clarify for @Mejonzhan, regexes don't handle *all* strings. They handle strings that conform to certain rules. For example, regexes can't handle the idea of matching `(` and `)` characters, either. With HTML, you can often get lucky, and by chance, handle the text with a regex. But that's not a given with HTML, because it's actually a tree structure, which is recursively defined, rather than sequentially, which is what a regex needs to function. – Ian McLaird Jul 10 '12 at 13:59

3 Answers3

1

Try (<img)(.*?)(/>), this should do the trick, although yes, you shouldn't use Regex for parsing HTML, as people will tell you over and over.

I don't have eclipse installed, but I have VS2010, and this works for me.

        String imageRegex = "(<img)(.*?)(/>)";
        String str = "<img src=\"static/image/smiley/comcom/9.gif\" smilieid=\"296\" border=\"0\" alt=\"\" />hello world<img src=\"static/image/smiley/comcom/7.gif\" smilieid=\"294\" border=\"0\" alt=\"\" />";
        System.Text.RegularExpressions.MatchCollection match = System.Text.RegularExpressions.Regex.Matches(str, imageRegex, System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        StringBuilder sb = new StringBuilder();
        foreach (System.Text.RegularExpressions.Match m in match)
        {
            sb.AppendLine(m.Value);
        }
        System.Windows.MessageBox.Show(sb.ToString());

Result:

<img src="static/image/smiley/comcom/9.gif" smilieid="296" border="0" alt="" /> 
<img src="static/image/smiley/comcom/7.gif" smilieid="294" border="0" alt="" />
GrayFox374
  • 1,742
  • 9
  • 13
0

David M is correct, you really shouldn't try to do this, but your specific problem is that the + quantifier in your regex is greedy, so it will match the longest possible substring that could match.

See The regex tutorial for more details on the quantifiers.

Ian McLaird
  • 5,507
  • 2
  • 22
  • 31
0

I'd NOT recommend to use regex for parsing HTML. Please consider JSoup or similar solutions

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements images = doc.select("img");

Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp.

Anton
  • 1,432
  • 13
  • 17