How to change this regex to properly extract tag attributes - should be simple

Question

I need to "grab" an attribute of a custom HTML tag. I know this sort of question has been asked many times before, but regex really messes with my head, and I can't seem to get it working.

A sample of XML that I need to work with is

 - content goes here - 

I want to be able to grab the value of the name attribute, which in this case is nameValue. What I have is shown below but this returns a null value.

My regex string (for a Java app, hence the \ to escape the ") is:
"(.)?.*(.)?"

I am trying to grab the attribute with quotation marks I figure this is the easiest and most general pattern to match. Well it just doesn't work, any help will help me keep my hair.

The HTML comments are there for good reason. I don't want the browser to show the tags — Ankur, Jun 17 '09 at 08:33

instanceof me · Answer 1 · 2009-06-17T08:53:05.473

2

Your search is greedy. Use "\<\!-- \<editable name=\"(.*?)\"\> --\>.*?\<\!-- \<\/editable\> --\>" (added ?). Please note that this one will not work correctly with nested <editable> elements.

If you don't want to perform syntax checking, you could also simply go with: "\<\!-- \<editable name=\"(.*?)\"\> --\>" or even "\<editable name=\"(.*?)\"\>" for better simplicity and performance.

Edit: should be

Pattern re = Pattern.compile( "\\<editable name=\"(.*?)\"\\>" );

edited Jun 17 '09 at 08:53

answered Jun 17 '09 at 08:36

instanceof me

38,520
3
31
40

That doesn't work either. What are the \ for in \?\> -- why would you escape the ? and > characters? – Ankur Jun 17 '09 at 08:41
Because those characters can be special characters in a regex. The ? is incorrect though, removed it. And actually in a Java string, I should escape the backslash as well => \\>. – instanceof me Jun 17 '09 at 08:52
'<', '>' and '!' don't need to be escaped. – Alan Moore Jun 17 '09 at 12:51
! is used in negative look-ahead pattern and < in look-behind. Indeed, > does not need to be escaped (yet). But it doesn't harm AFAIK, so I often do it anyway. – instanceof me Jun 17 '09 at 15:32

score 2 · Answer 2 · edited Jun 17 '09 at 12:53

2

I use JavaScript, but it should help to make the expression non-greedy where possible and use not matches instead of any character matches. Not sure how similar regexps are with Java, but instead of using the expression \".*\" try using \"[^\"]*\". That will search for any character within the attribute value that isn't a quote, meaning the expression can't match beyond the attribute value.

Hope that helps

edited Jun 17 '09 at 12:53

Alan Moore

73,866
12
100
156

answered Jun 17 '09 at 08:37

Andy E

338,112
86
474
445

1

+1 for the not-quotes approach. FYI, Java regexes can do everything the JavaScript flavor can, plus a lot more. – Alan Moore Jun 17 '09 at 12:59
Thanks. Yeah, I know Javascript's regexes are lacking in some areas lookbehinds, for example. Hopefully that will improve in time. – Andy E Jun 18 '09 at 08:21

score 2 · Accepted Answer · answered Jun 17 '09 at 08:42

I don't think you need the (.)?s at the beginning and end of your regex. And you need to put in a capturing group for getting only the content-goes-here bit:

This worked for me:

String xml = "RANDOM STUFF<!-- <editable name=\"nameValue\"> --> - content goes here - <!-- </editable> -->RANDOM STUFF";
Pattern p = Pattern.compile("<!-- <editable name=(\".*\")?> -->(.*)<!-- </editable> -->");
Matcher m = p.matcher(xml);
if (m.find()) {
    System.out.println(m.group(2));
} else {
    System.out.println("no match found");
}

This prints:

 - content goes here -

score 0 · Answer 4 · edited May 23 '17 at 11:47

0

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

You may find the answer using TagSoup helpful.

edited May 23 '17 at 11:47

Community

1
1

answered Jun 17 '09 at 13:58

Chas. Owens

64,182
22
135
226

How to change this regex to properly extract tag attributes - should be simple

4 Answers4