You say you've already extracted the <img>
tag and you're working on it as a standalone string. That makes the job simpler, but there's still a great deal of complexity to deal with. For example, how would you handle this tag?
<img foosrc="whatever" barclass=noclass src =
folder/img.jpg class ='ho hum' ></img>
Here you've got:
- more than one space following the tag name
- attributes whose names only end with
src
and class
- a linefeed instead of a space after the second
=
- more than one space between an attribute name and the
=
- single-quotes instead of double-quotes around an attribute value
- no final
/
because the author used an old HTML-style image tag with a closing tag, not an XML-style self-closing tag.
...and it's all just as valid as the sample tags you provided. Maybe you know you'll never have to deal with any of those issues, but we don't. If we supply you with a regex tailored to your sample data without even mentioning these other issues, are we really helping you? Or helping the others with similar problems who happen to find this page?
Her you go then:
String[] tags = { "<img src = \"the source\" class=class01 />",
"<img class=class02 src=folder/img02.jpg />",
"<img class= \"class03\" / >",
"<img foosrc=\"whatever\" barclass=noclass" +
" class='class04' src =\nfolder/img04.jpg></img>" };
String regex =
"(?i)\\s+(src|class)\\s*=\\s*(?:\"([^\"]+)\"|'([^']+)'|(\\S+?)(?=\\s|/?\\s*>))";
Pattern p = Pattern.compile(regex);
int n = 1;
for (String tag : tags)
{
System.out.printf("%ntag %d: %s%n", n++, tag);
Matcher m = p.matcher(tag);
while (m.find())
{
System.out.printf("%8s: %s%n", m.group(1),
m.start(2) != -1 ? m.group(2) :
m.start(3) != -1 ? m.group(3) :
m.group(4));
}
}
output:
tag 1: <img src = "the source" class=class01 />
src: the source
class: class01
tag 2: <img class=class02 src=folder/img02.jpg />
class: class02
src: folder/img02.jpg
tag 3: <img class= "class03" / >
class: class03
tag 4: <img foosrc="whatever" barclass=noclass class='class04' src =
folder/img04.jpg></img>
class: class04
src: folder/img04.jpg
Here's a more readable form of the regex:
(?ix) # ignore-case and free-spacing modes
\s+ # leading \s+ ensures we match the whole name
(src|class) # the attribute name is stored in group1
\s*=\s* # \s* = any number of any whitespace
(?: # the attribute value, which may be...
"([^"]+)" # double-quoted (group 2)
| '([^']+)' # single-quoted (group 3)
| (\S+?)(?=\s|/?\s*>) # or not quoted (group 4)
)