14

I am looking for a regular expression that can get me src (case insensitive) tag from following HTML snippets in java.

<html><img src="kk.gif" alt="text"/></html>
<html><img src='kk.gif' alt="text"/></html>
<html><img src = "kk.gif" alt="text"/></html>
Mnementh
  • 50,487
  • 48
  • 148
  • 202
Krishna Kumar
  • 7,841
  • 14
  • 49
  • 61

4 Answers4

27

One possibility:

String imgRegex = "<img[^>]+src\\s*=\\s*['\"]([^'\"]+)['\"][^>]*>";

is a possibility (if matched case-insensitively). It's a bit of a mess, and deliberately ignores the case where quotes aren't used. To represent it without worrying about string escapes:

<img[^>]+src\s*=\s*['"]([^'"]+)['"][^>]*>

This matches:

  • <img
  • one or more characters that aren't > (i.e. possible other attributes)
  • src
  • optional whitespace
  • =
  • optional whitespace
  • starting delimiter of ' or "
  • image source (which may not include a single or double quote)
  • ending delimiter
  • although the expression can stop here, I then added:
    • zero or more characters that are not > (more possible attributes)
    • > to close the tag

Things to note:

  • If you want to include the src= as well, move the open bracket further left :-)
  • This does not care about delimiter balancing or attribute values without delimiters, and it can also choke on badly-formed attributes (such as attributes that include > or image sources that include ' or ").
  • Parsing HTML with regular expressions like this is non-trivial, and at best a quick hack that works in the majority of cases.
DMI
  • 6,843
  • 2
  • 24
  • 25
  • Thanks; this returns "" match for string . can this expression be change to get me only "kk.txt"; hope I am not asking for too much;) – Krishna Kumar Jul 03 '09 at 14:19
  • The first submatch should return what you want. See http://java.sun.com/docs/books/tutorial/essential/regex/groups.html for how to access the group. You essentially want to use the `group()` method on your match result, with the argument `1`. – DMI Jul 05 '09 at 20:59
  • See the code from cletus above for an example on how to get a captured subgroup -- you just want the argument to `group()` to be `1`. – DMI Jul 05 '09 at 21:02
  • 2
    I'm so glad there exists people in this world that not only understand regular expressions much more than I, but also are nice enough to share that understanding. This regex was precisely what I needed. Thank you!!! – The Awnry Bear Oct 06 '12 at 19:43
  • I also want to get the full how to do that? – hasan May 17 '14 at 14:12
18

This question comes up a lot here.

Regular expressions are a bad way of handling this problem. Do yourself a favour and use an HTML parser of some kind.

Regexes are flaky for parsing HTML. You'll end up with a complicated expression that'll behave unexpectedly in some corner cases that will happen otherwise.

Edit: If your HTML is that simple then:

Pattern p = Pattern.compile("src\\s*=\\s*([\\"'])?([^ \\"']*)");
Matcher m = p.matcher(str);
if (m.find()) {
  String src = m.group(2);
}

And there are any number of Java HTML parsers out there.

cletus
  • 616,129
  • 168
  • 910
  • 942
  • even xpath would be better for this *sigh* – annakata Jul 03 '09 at 13:54
  • 2
    Saying that without linking to a parser is not really useful. – wds Jul 03 '09 at 13:59
  • 1
    I agree; but I have a small snippet in data and for each data element in loop and not sure whether parser loading and getting the value will be viable from performance point of view – Krishna Kumar Jul 03 '09 at 14:04
  • 1
    @wds, saying _that_ without linking to a parser is also not useful ;). here is a list of open source java parsers: http://java-source.net/open-source/html-parsers – akf Jul 03 '09 at 14:09
  • @cletus, just FYI -- I *was* using an HTML parser because the theoretical, do-things-"The Right Way(tm)" part of me wanted to, well, do things the right way. :) Unfortunately, it turns out running an HTML parser--even a lightweight one--on dozens of HTML strings under resource-limited Android devices was found to be a bit impractical. The regex method on the other hand is extremely fast... processing times were lowered from ~30 seconds per RSS feed (with an average of 10 HTML strings to parse per feed) to ~2 seconds. Bypassing the parser using a basic XPath solution may be a good compromise. – The Awnry Bear Oct 06 '12 at 19:50
1

This answer is for google searchers, Because it's too late

Copying cletus's showed error and Modifying his answer and passing modified String src\\s*=\\s*([\"'])?([^\"']*) as parameter passed into Pattern.compile worked for me,

Here is the full example

    String htmlString = "<div class=\"current\"><img src=\"img/HomePageImages/Paris.jpg\"></div>"; //Sample HTML

    String ptr= "src\\s*=\\s*([\"'])?([^\"']*)";
    Pattern p = Pattern.compile(ptr);
    Matcher m = p.matcher(htmlString);
    if (m.find()) {
        String src = m.group(2); //Result
    }
Shree Krishna
  • 8,474
  • 6
  • 40
  • 68
0

You mean the src-attribute of the img-Tag? In that case you can go with the following:

<[Ii][Mm][Gg]\\s*([Ss][Rr][Cc]\\s*=\\s*[\"'].*?[\"'])

That should work. The expression src='...' is in parantheses, so it is a matcher-group and can be processed separately.

Mnementh
  • 50,487
  • 48
  • 148
  • 202
  • yes; I need src attribute from the image; but this expression compilation in java; can you please verify this. – Krishna Kumar Jul 03 '09 at 13:52
  • 1
    That will work, until somebody uses apostrophes instead of double quotes to limit the attribute value (src='foo'). Also, your approach would fail if the img tag had other attributes. The complexity involved is fairly high, although you can get most cases right with a good regex. I don't have one handy though. – Jouni Heikniemi Jul 03 '09 at 13:55
  • 1
    Thanks for the reply; this regEx compilation is failing in java with following error. java.util.regex.PatternSyntaxException: Unclosed gro p near index 43 <[Ii][Mm][Gg]\s*([Ss][Rr][Cc]\s*=\s*\".*?\" ^ – Krishna Kumar Jul 03 '09 at 13:58
  • This compiles fine now; but does not return a mathc for src in the following text. – Krishna Kumar Jul 03 '09 at 14:08