2

I need to "grab" an attribute of a custom HTML tag. I know this sort of question has been asked many times before, but regex really messes with my head, and I can't seem to get it working.

A sample of XML that I need to work with is

<!-- <editable name="nameValue"> --> - content goes here - <!-- </editable> -->

I want to be able to grab the value of the name attribute, which in this case is nameValue. What I have is shown below but this returns a null value.

My regex string (for a Java app, hence the \ to escape the ") is:
"(.)?<!-- <editable name=(\".*\")?> -->.*<!-- </editable> -->(.)?"

I am trying to grab the attribute with quotation marks I figure this is the easiest and most general pattern to match. Well it just doesn't work, any help will help me keep my hair.

Ankur
  • 50,282
  • 110
  • 242
  • 312

4 Answers4

2

Your search is greedy. Use "\<\!-- \<editable name=\"(.*?)\"\> --\>.*?\<\!-- \<\/editable\> --\>" (added ?). Please note that this one will not work correctly with nested <editable> elements.

If you don't want to perform syntax checking, you could also simply go with: "\<\!-- \<editable name=\"(.*?)\"\> --\>" or even "\<editable name=\"(.*?)\"\>" for better simplicity and performance.

Edit: should be

Pattern re = Pattern.compile( "\\<editable name=\"(.*?)\"\\>" );
instanceof me
  • 38,520
  • 3
  • 31
  • 40
  • That doesn't work either. What are the \ for in \?\> -- why would you escape the ? and > characters? – Ankur Jun 17 '09 at 08:41
  • Because those characters can be special characters in a regex. The ? is incorrect though, removed it. And actually in a Java string, I should escape the backslash as well => \\>. – instanceof me Jun 17 '09 at 08:52
  • '<', '>' and '!' don't need to be escaped. – Alan Moore Jun 17 '09 at 12:51
  • ! is used in negative look-ahead pattern and < in look-behind. Indeed, > does not need to be escaped (yet). But it doesn't harm AFAIK, so I often do it anyway. – instanceof me Jun 17 '09 at 15:32
2

I use JavaScript, but it should help to make the expression non-greedy where possible and use not matches instead of any character matches. Not sure how similar regexps are with Java, but instead of using the expression \".*\" try using \"[^\"]*\". That will search for any character within the attribute value that isn't a quote, meaning the expression can't match beyond the attribute value.

Hope that helps

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Andy E
  • 338,112
  • 86
  • 474
  • 445
  • 1
    +1 for the not-quotes approach. FYI, Java regexes can do everything the JavaScript flavor can, plus a lot more. – Alan Moore Jun 17 '09 at 12:59
  • Thanks. Yeah, I know Javascript's regexes are lacking in some areas lookbehinds, for example. Hopefully that will improve in time. – Andy E Jun 18 '09 at 08:21
2

I don't think you need the (.)?s at the beginning and end of your regex. And you need to put in a capturing group for getting only the content-goes-here bit:

This worked for me:

String xml = "RANDOM STUFF<!-- <editable name=\"nameValue\"> --> - content goes here - <!-- </editable> -->RANDOM STUFF";
Pattern p = Pattern.compile("<!-- <editable name=(\".*\")?> -->(.*)<!-- </editable> -->");
Matcher m = p.matcher(xml);
if (m.find()) {
    System.out.println(m.group(2));
} else {
    System.out.println("no match found");
}

This prints:

 - content goes here - 
Zarkonnen
  • 22,200
  • 14
  • 65
  • 81
0

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

You may find the answer using TagSoup helpful.

Community
  • 1
  • 1
Chas. Owens
  • 64,182
  • 22
  • 135
  • 226