10

I have a String of the format "[(1, 2), (2, 3), (3, 4)]", with an arbitrary number of elements. I'm trying to split it on the commas separating the coordinates, that is, to retrieve (1, 2), (2, 3), and (3, 4).

Can I do it in Java regex? I'm a complete noob but hoping Java regex is powerful enough for it. If it isn't, could you suggest an alternative?

Paul Wagland
  • 27,756
  • 10
  • 52
  • 74
Humphrey Bogart
  • 7,423
  • 14
  • 52
  • 59

6 Answers6

9

From Java 5

Scanner sc = new Scanner();
sc.useDelimiter("\\D+"); // skip everything that is not a digit
List<Coord> result = new ArrayList<Coord>();
while (sc.hasNextInt()) {
    result.add(new Coord(sc.nextInt(), sc.nextInt()));
}
return result;

EDIT: We don't know how much coordinates are passed in the string coords.

h7r
  • 4,944
  • 2
  • 28
  • 31
Hubert
  • 829
  • 1
  • 8
  • 15
  • Nice solution! And if you replace `Coord` with `java.awt.Point` it compiles as it is. – Fabian Steeg Feb 01 '10 at 22:23
  • 1
    Watch out for negative values! – notnoop Feb 02 '10 at 15:00
  • @notnoop : true and as strange as it seems I couldn't succeed in using a delimiter pattern like "[^-0-9]*", I had to use something less trivial like "[^0-9]*[(),]\\s*". I'm on Sun JDK6. – Hubert Feb 02 '10 at 17:01
  • I love this! However as I asked for the regex I'll chose the best regex answer as the correct one for the sake of people with a similar question ;) 1 INTERNET FOR YOU – Humphrey Bogart Feb 08 '10 at 00:20
7

You can use String#split() for this.

String string = "[(1, 2), (2, 3), (3, 4)]";
string = string.substring(1, string.length() - 1); // Get rid of braces.
String[] parts = string.split("(?<=\\))(,\\s*)(?=\\()");
for (String part : parts) {
    part = part.substring(1, part.length() - 1); // Get rid of parentheses.
    String[] coords = part.split(",\\s*");
    int x = Integer.parseInt(coords[0]);
    int y = Integer.parseInt(coords[1]);
    System.out.printf("x=%d, y=%d\n", x, y);
}

The (?<=\\)) positive lookbehind means that it must be preceded by ). The (?=\\() positive lookahead means that it must be suceeded by (. The (,\\s*) means that it must be splitted on the , and any space after that. The \\ are here just to escape regex-specific chars.

That said, the particular String is recognizeable as outcome of List#toString(). Are you sure you're doing things the right way? ;)

Update as per the comments, you can indeed also do the other way round and get rid of non-digits:

String string = "[(1, 2), (2, 3), (3, 4)]";
String[] parts = string.split("\\D.");
for (int i = 1; i < parts.length; i += 3) {
    int x = Integer.parseInt(parts[i]);
    int y = Integer.parseInt(parts[i + 1]);
    System.out.printf("x=%d, y=%d\n", x, y);
}

Here the \\D means that it must be splitted on any non-digit (the \\d stands for digit). The . after means that it should eliminate any blank matches after the digits. I must however admit that I'm not sure how to eliminate blank matches before the digits. I'm not a trained regex guru yet. Hey, Bart K, can you do it better?

After all, it's ultimately better to use a parser for this. See Huberts answer on this topic.

Community
  • 1
  • 1
BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
  • There are commas in the substrings as well... You can `string.split("),");`, and after this to bring back the `)`. – Yaakov Shoham Feb 01 '10 at 21:09
  • Oops, didn't notice that .. Updated answer. – BalusC Feb 01 '10 at 21:16
  • Well spotted! I'm trying to reproduce a list of coordinates from, ahem, a List effectively. – Humphrey Bogart Feb 01 '10 at 21:16
  • @Beau, and you have no reference to that List any more? It is a bit brittle to create it from the output of a `toString()` return... – Bart Kiers Feb 01 '10 at 21:23
  • @Bart If only! I'm retreiving Strings representing a series of moves from a game via a web service. Strong typing FTW! – Humphrey Bogart Feb 01 '10 at 21:26
  • @Beau, I now see what you need. I added a few more lines to get the coords out. – BalusC Feb 01 '10 at 21:34
  • Great stuff. This tempted me to mess around with Regex expressions and I came up with \\([0-9], [0-9]\\) to NOT include anything that has the form of coordinates. It would be nice to get it working with a negative lookaround as explained in this link: http://stackoverflow.com/questions/406230/regular-expression-to-match-string-not-containing-a-word – James P. Feb 01 '10 at 21:40
  • Whau, didn't know you could do *that* with a regular expression! Guess I need fetch "Mastering Regular Expressions" from the shelf and read up on this stuff :) – Jørn Schou-Rode Feb 01 '10 at 21:47
  • 2
    That being said, in the particular case of parsing coordinates, I would recommend the simpler/more comprehensible solution from my answer or the `Scanner` solution suggested by Hubert. – Jørn Schou-Rode Feb 01 '10 at 21:51
  • Yes, that kind of strings are after all indeed better to be parsed/tokenized. – BalusC Feb 01 '10 at 22:10
3

If you do not require the expression to validate the syntax around the coordinates, this should do:

\(\d+,\s\d+\)

This expression will return several matches (three with the input from your example).

In your question, you state that you want to "retreive (1, 2), (2, 3), and (3, 4). In the case that you actually need the pair of values associated with each coordinate, you can drop the parentheses and modify the regex to do some captures:

(\d+),\s(\d+)

The Java code will look something like this:

import java.util.regex.*;

public class Test {
    public static void main(String[] args) {
        Pattern pattern = Pattern.compile("(\\d+),\\s(\\d+)");
        Matcher matcher = pattern.matcher("[(1, 2), (2, 3), (3, 4)]");

        while (matcher.find()) {
            int x = Integer.parseInt(matcher.group(1));
            int y = Integer.parseInt(matcher.group(2));
            System.out.printf("x=%d, y=%d\n", x, y);
        }
    }
}
Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
Jørn Schou-Rode
  • 37,718
  • 15
  • 88
  • 122
1

Will there always be 3 groups of coordinates that need to be analyzed?

You could try:

\[(\(\d,\d\)), (\(\d,\d\)), (\(\d,\d\))\]

dplante
  • 2,445
  • 3
  • 21
  • 27
FrustratedWithFormsDesigner
  • 26,726
  • 31
  • 139
  • 202
  • Not necessarily! I'll edit the question; cheers on the quick reply. I'm assuming some ?*+ quantifiers will do the trick from there? – Humphrey Bogart Feb 01 '10 at 21:11
1

If you use regex, you are going to get lousy error reporting and things will get exponentially more complicated if your requirements change (For instance, if you have to parse the sets in different square brackets into different groups).

I recommend you just write the parser by hand, it's like 10 lines of code and shouldn't be very brittle. Track everything you are doing, open parens, close parens, open braces & close braces. It's like a switch statement with 5 options (and a default), really not that bad.

For a minimal approach, open parens and open braces can be ignored, so there are really only 3 cases.


This would be the bear minimum.

// Java-like psuedocode
int valuea;
String lastValue;
tokens=new StringTokenizer(String, "[](),", true);

for(String token : tokens) {  

    // The token Before the ) is the second int of the pair, and the first should
    // already be stored
    if(token.equals(")"))
        output.addResult(valuea, lastValue.toInt());

    // The token before the comma is the first int of the pair
    else if(token.equals(",")) 
        valuea=lastValue.toInt();

    // Just store off this token and deal with it when we hit the proper delim
    else
        lastValue=token;
}

This is no better than a minimal regex based solution EXCEPT that it will be MUCH easier to maintain and enhance. (add error checking, add a stack for paren & square brace matching and checking for misplaced commas and other invalid syntax)

As an example of expandability, if you were to have to place different sets of square-bracket delimited groups into different output sets, then the addition is something as simple as:

    // When we close the square bracket, start a new output group.
    else if(token.equals("]"))
        output.startNewGroup();

And checking for parens is as easy as creating a stack of chars and pushing each [ or ( onto the stack, then when you get a ] or ), pop the stack and assert that it matches. Also, when you are done, make sure your stack.size() == 0.

Bill K
  • 62,186
  • 18
  • 105
  • 157
  • ...You might be on to something here... Any chance you could mock-up some code? – Humphrey Bogart Feb 01 '10 at 21:34
  • This sounds like the event-driven approach SAX uses to parse XML. I suppose you'll need to go through the text character by character and build up a series of algorithms to detect various patterns. – James P. Feb 01 '10 at 21:48
0

In regexes, you can split on (?<=\)), which use Positive Lookbehind:

string[] subs = str.replaceAll("\[","").replaceAll("\]","").split("(?<=\)),");

In simpe string functions, you can drop the [ and ] and use string.split("),"), and return the ) after it.

Yaakov Shoham
  • 10,182
  • 7
  • 37
  • 45
  • 1
    Your regex produces `(1`, `2), (2`, `3), (3` and `4)` on given example? – BalusC Feb 01 '10 at 21:19
  • 1
    The `"(?<=\\)),\\s*"` would be nicer as it covers spaces as well. In Java regex strings you by the way need to double-escape the \. – BalusC Feb 01 '10 at 21:25