1

I'm having a hard time using regular expressions in Java even after reading numerous tutorials online. I'm trying to extract parts of a String received to be used later in my application.

Here are examples of the possible String received:

53248 <CERCLE> 321 211 55 </CERCLE>
57346 <RECTANGLE> 272 99 289 186 </RECTANGLE>

The first number is to be extracted as a sequence number. The word between <> is to be extracted as well. Then, the sequence of numbers in between as well.

Here is my pattern:

"(\\d+)\\s*<(\\w+)>\\s*((\\d+\\s*)+)\\s*</\\w*>.*"

Here is the code for my method so far:

public decompose(String s) throws IllegalArgumentException {

    Pattern pattern = Pattern.compile(PATTERN);
    Matcher matcher = pattern.matcher(s);

    noSeq = Integer.parseInt(matcher.group(1));
    type = typesFormes.valueOf(matcher.group(2));
    strCoords = matcher.group(3).split(" ");

}

Problem is that when I run the code, all my matcher groups are at -1 for some reason (not found I guess). I've been banging my head on this for a while and any suggestion is welcome :) Thanks.

nl-x
  • 11,762
  • 7
  • 33
  • 61
JulioQc
  • 310
  • 1
  • 4
  • 20
  • 3
    I think you need to run `matcher.find()` first. I was having a problem similar to this a little while ago: http://stackoverflow.com/questions/23657575/java-regex-to-parse-any-number-of-markdown-style-links – 2rs2ts May 16 '14 at 21:59
  • 1
    More specifically, either `matcher.find()`, `matcher.matches()`, `matcher.lookingAt()`. See the [Matcher](http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html) javadoc. – ajb May 16 '14 at 22:07

3 Answers3

1

Simply try with String#split()

  String str="53248 <CERCLE> 321 211 55 </CERCLE>";
  String[] array=str.split("(\\s<|>\\s)"); 
  // simple regex (space < OR > space)

Note: Try with \\s+ if there are one ore more spaces.

Use first three values of array that will be 53248, CERCLE, 321 211 55 in this case.


Complete code:

String str = "53248 <CERCLE> 321 211 55 </CERCLE>";
String[] array = str.split("(\\s<|>\\s)");

int noSeq = Integer.valueOf(array[0]);
String type = array[1];
String strCoords = array[2];

System.out.println(noSeq+", "+type+", "+strCoords);

output:

53248, CERCLE, 321 211 55
Braj
  • 46,415
  • 5
  • 60
  • 76
  • Looks valid but i'm doing this for college and a requirement is to use regex to split the string. – JulioQc May 16 '14 at 23:36
  • What do you think about `\\s<|>\\s` that is used in `split()` method? Isn't t a Regex? – Braj May 17 '14 at 05:26
1

You just needed to tell the matcher to start matching the pattern against the input string. This works for me on ideone:

String s = "53248 <CERCLE> 321 211 55 </CERCLE>";
String PATTERN = "(\\d+)\\s*<(\\w+)>\\s*((\\d+\\s*)+)\\s*</\\w*>.*";
Pattern pattern = Pattern.compile(PATTERN);
Matcher matcher = pattern.matcher(s);
matcher.find();                         // aye, there's the rub
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
System.out.println(matcher.group(3));

Output was:

53248
CERCLE
321 211 55

The find() method, when successful, will let the matcher yield the information you want. From the javadocs:

If the match succeeds then more information can be obtained via the start, end, and group methods.

group() says something similarly indicative, emphasis mine:

Returns the input subsequence captured by the given group during the previous match operation.

2rs2ts
  • 10,662
  • 10
  • 51
  • 95
1

As @2rs2ts pointed out, the problem was the missing matcher.find() call.

I would further improve like this:

final String PATTERN = "(\\d+)\\s*<(\\w+)>\\s*([\\d\\s]+)\\s*</\\2>.*";
String s = "53248 <CERCLE> 321 211 55 </CERCLE>";
Pattern pattern = Pattern.compile(PATTERN);
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
    System.out.println(matcher.group(1));
    System.out.println(matcher.group(2));
    System.out.println(matcher.group(3).trim());
}

Some improvements:

  • In the pattern, you can simplify ((\\d+\\s*)+) as ([\\d\\s]+). For your purpose, it's equivalent.
  • In the pattern, you probably want to match <CERCLE> with a closing </CERCLE>, not </OTHER>. You can do that using \\2, which is a back reference to the 2nd capture group.
  • You can judge by the result of matcher.find() if anything was matched.
  • Before you split the list of numbers in the middle, you might want to trim the possible trailing whitespace at the end using .trim().
janos
  • 120,954
  • 29
  • 226
  • 236
  • Adding a conditional is indeed a good idea. However I believe its somehow useless with the throw of the method (error will be handled elsewhere) As for the numbers, I do this in a different method which kinda trim at the same time. for (int i = 0; i < strCoords.length; i++) intCoord[i] = Integer.parseInt(strCoords[i]); Thanks for the other advice, will definitely use them :) – JulioQc May 16 '14 at 23:38
  • Hi @JulioQc, that point was more for the sake of illustration. If it's more appropriate in your design to handle these cases by allowing the exception to bubble up, then by all means, do it that way, it's your call. – janos May 19 '14 at 14:58