2

Hello I got a problem using the regex with Java.

I'm trying to parse this :

*whatever string*
<AttributeDesignator AttributeId="MyIDToParse"
DataType="http://www.w3.org/2001/XMLSchema#string"
Category="theCategoryIWantToParse"
MustBePresent="false"
/>
*whatever string that may contain the same regular expression*

using this code (Pattern + Matcher)

Pattern regex = Pattern.compile("AttributeDesignator +AttributeId=\"(.+)\" +.*Category=\"(.+)", Pattern.DOTALL);
Matcher matcher = regex.matcher(xml);
while (matcher.find()) {
    String ID = matcher.group(1);
    String Category = matcher.group(2);

The output is the following :

group 1 :

MyIDToParse"
    DataType="http://www.w3.org/2001/XMLSchema#string"
    Category="theCategoryIWantToParse"
    MustBePresent="false"
    />
    *whatever string that may contain the same regular expression*

group2 :

theCategoryIWantToParse"
    MustBePresent="false"
    />
    *whatever string that may contain the same regular expression*

I feel like it's a simple thing but I can't find whatever I'm doing wrong.. When I used the regex in a website to test them it works correctly and highlight the right expression from my xml entry.

VLAZ
  • 26,331
  • 9
  • 49
  • 67
Neil
  • 332
  • 2
  • 15
  • 1
    @MartinPieters saw fit to delete my answer. But ignore the answer at your peril: any attempt to use regular expressions to parse XML will work on some input files and fail on others. That's nothing to do with your skills in writing regular expressions, it's a fundamental theory of computer science. – Michael Kay Aug 05 '15 at 07:47

1 Answers1

2

Try to use the non-greedy regex.

    Pattern regex = Pattern.compile("AttributeDesignator AttributeId=\"(.+?)\".*Category=\"(.+?)\"", Pattern.DOTALL);
Codebender
  • 14,221
  • 7
  • 48
  • 85
  • If you suggest a regex approach, you'd want to also use lazy matching with `.*?`. Also, `.+?` won't match empty values. – Wiktor Stribiżew Aug 04 '15 at 09:42
  • Thanks I have and improvement but in one document the pattern may appear a lot of time.. right now it only parses the first appearance of it.. – Neil Aug 04 '15 at 14:15
  • @sasuke256, `while(matcher.find())` should iterate through all the occurances ideally. Why do you think it matches only the first? – Codebender Aug 04 '15 at 14:23
  • I have an xml with more than one AttributeDesignator Tag but it doesnt detect all of them. – Neil Aug 04 '15 at 14:28