How to get attributes and values from badly formatted string in Java

Question

I need to get the attributes and values from multiple strings such as these:

<img src = "the source" class=class01 />
<img class=class02 src=folder/img.jpg />
<img class= "class01" / >

Spaces and slashes are accepted in values, and some values are enclosed in quotes, while not all are. Some equal signs are spaced.

I'm new to this, so the code is messy and probably not foolproof.

My attempt:

//remove unnecessary spacing and "<img" and "/>"
str = str.replaceAll("/ >", "/>");
str = str.substring(4, str.length()-1);
str = str.replaceAll(" =", "=");
str = str.replaceAll("= ", "=");

//remove quotes
str = str.replaceAll("\"", "");

//creating a matcher and compiling the regex pattern is omitted, because I know how to do that using matcher.group();
regexSrc = "src=(.*?)($| class=)";
String srcString = matcherSrc.group(1);

regexClass = "class=(.*?)($| src=)";
String classString = matcherClass.group(1);

System.out.println("the source is: " + srcString);
System.out.println("the class is: " + classString);

Any suggestions how to do this is a better way are appreciated.

You may want to use an HTML parser rather than regex: [jsoup](http://jsoup.org/) is often recommended as a good one. — MarcoS, May 05 '11 at 09:46

score 2 · Answer 1 · edited May 05 '11 at 09:48

2

If it is a poorly formatted HTML code, then use JTidy to clean it up and then use some simpler regular expression or HTML parser.

edited May 05 '11 at 09:48

Stephen C

698,415
94
811
1,216

answered May 05 '11 at 09:45

Andrey Adamovich

20,285
14
94
132

Thanks, I will take a look at it. Is it possible to make it without external files? This is not so much a HTML question, because I already have the strings and the problem in the inconsequent formatting when it comes to spacing/quotes. I was thinking of about 10-15 lines of code. – Klas Herlin May 05 '11 at 09:59
JTidy is a Java library and it has an API which you can call from your code. Look at the last section of this page: http://jtidy.sourceforge.net/howto.html – Andrey Adamovich May 05 '11 at 10:34

score 1 · Accepted Answer · answered May 05 '11 at 14:04

You say you've already extracted the <img> tag and you're working on it as a standalone string. That makes the job simpler, but there's still a great deal of complexity to deal with. For example, how would you handle this tag?

<img  foosrc="whatever" barclass=noclass src =
folder/img.jpg class   ='ho hum' ></img>

Here you've got:

more than one space following the tag name
attributes whose names only end with src and class
a linefeed instead of a space after the second =
more than one space between an attribute name and the =
single-quotes instead of double-quotes around an attribute value
no final / because the author used an old HTML-style image tag with a closing tag, not an XML-style self-closing tag.

...and it's all just as valid as the sample tags you provided. Maybe you know you'll never have to deal with any of those issues, but we don't. If we supply you with a regex tailored to your sample data without even mentioning these other issues, are we really helping you? Or helping the others with similar problems who happen to find this page?

Her you go then:

String[] tags = { "<img src = \"the source\" class=class01 />",
                  "<img class=class02 src=folder/img02.jpg />",
                  "<img class= \"class03\" / >", 
                  "<img  foosrc=\"whatever\" barclass=noclass" +
                  "    class='class04' src =\nfolder/img04.jpg></img>" };

String regex = 
  "(?i)\\s+(src|class)\\s*=\\s*(?:\"([^\"]+)\"|'([^']+)'|(\\S+?)(?=\\s|/?\\s*>))";
Pattern p = Pattern.compile(regex);
int n = 1;
for (String tag : tags)
{
  System.out.printf("%ntag %d: %s%n", n++, tag);
  Matcher m = p.matcher(tag);
  while (m.find())
  {
    System.out.printf("%8s: %s%n", m.group(1),
        m.start(2) != -1 ? m.group(2) :
        m.start(3) != -1 ? m.group(3) :
        m.group(4));
  }
}

output:

tag 1: <img src = "the source" class=class01 />
     src: the source
   class: class01

tag 2: <img class=class02 src=folder/img02.jpg />
   class: class02
     src: folder/img02.jpg

tag 3: <img class= "class03" / >
   class: class03

tag 4: <img  foosrc="whatever" barclass=noclass    class='class04' src =
folder/img04.jpg></img>
   class: class04
     src: folder/img04.jpg

Here's a more readable form of the regex:

(?ix)   # ignore-case and free-spacing modes
\s+           # leading \s+ ensures we match the whole name
(src|class)   # the attribute name is stored in group1
\s*=\s*       # \s* = any number of any whitespace
(?:           # the attribute value, which may be...
   "([^"]+)"              # double-quoted (group 2)
 | '([^']+)'              # single-quoted (group 3)
 | (\S+?)(?=\s|/?\s*>)    # or not quoted (group 4)
)

Thanks a lot, I will try to incorporate this into my code today and see how it works out and hopefully mark the question as solved. Yes, I should have been more precise, so here it is: a code snipped that can take care of the examples I provided will work out for me in most - probably all - my strings. 'src' and 'class' are the only attributes, and they are all correcly spelled. Double spacing, newlines and unclosed tags is not an issue. I'll be back with an update in a few hours, again, thanks! — Klas Herlin, May 05 '11 at 17:16

score 0 · Answer 3 · edited May 23 '17 at 12:19

0

A lot of people think it is a bad idea to use regexes to parse HTML:

and top them all off ...

RegEx match open tags except XHTML self-contained tags

(though this guy seems to disagree - RegEx match open tags except XHTML self-contained tags)

edited May 23 '17 at 12:19

Community

1
1

answered May 05 '11 at 10:00

Stephen C

698,415
94
811
1,216

I've head that before but I'm not looking for a fullblown HTML parser, but a parser that: removes unnecessary spacing and adds quotes where needed. Is it still a bad idea to try to do it using regex? src and class are the only attributes in the strings I already have in an array. – Klas Herlin May 05 '11 at 10:18
Normally I would be screaming "USE A PARSER", but in this case the formatting is so bad I wasn't sure whether even something like jTidy would fix it. – Richard H May 05 '11 at 10:19
@Klas - but what about those poor little kittens?? You cruel man! :-) – Stephen C May 05 '11 at 14:01

morja · Answer 4 · 2011-05-05T22:12:31.283

0

As Stephen C answered it might be generally not so safe to use regex for that. It might get you into troubles.

But here is something that might do what you need, at least for the given example:

 ([a-z]+) *= *"?((?:(?! [a-z]+ *=|/? *>|").)+)

See in rubular.

You may have to test it against more possible inputs and maybe there need to be adjustments.

Here in java code:

Pattern p = Pattern.compile("([a-z]+) *= *\"?((?:(?! [a-z]+ *=|/? *>|\").)+)", Pattern.DOTALL);
Matcher m = p.matcher(input);
while (m.find()){
    String key = m.group(1);
    String value = m.group(2);
    System.out.printf("%1s:%2s\n", key, value);
}

edited May 05 '11 at 22:12

answered May 05 '11 at 11:05

morja

8,297
2
39
59

This is the type of regex that I was looking for, thanks. I will take a look at it and compare it with Alan Moore's longer and more foolproof code. Sorry it took a while to answer, I didn't expect this response from the community :) – Klas Herlin May 05 '11 at 17:13
I marked Alan Moore's answer as the accepted one since it was the first complete answer. Your updated answer will probably work as well. Thank you and all other people who contributed to this thread. – Klas Herlin May 06 '11 at 18:02

How to get attributes and values from badly formatted string in Java

4 Answers4