3

Is it possible to split a String with the conditions as below?

  • split by , (i.e. comma)
  • on each element, ignore checking comma within the first ' and the last '
  • on each element, ignore checking comma within the first ( and the last )

e.g.

String source = "to_char(DATE, 'YYYY,MM,DD'), 'I am sad :(', to_char(DATE, ('YYYY(MM,DD)')), to_char(DATE, ('YYYY,MM,DD)')), to_char(DATE, ('YYYY(MM,DD')), NAME, to_char(DATE, '(YYYY)MM,DD'), CITY || ', (UK)', CITY || ', US''s CITY', CITY || ', UK'";

String[] expected = new String[]{
"to_char(DATE, 'YYYY,MM,DD')", 
"'I am sad :('",
"to_char(DATE, ('YYYY(MM,DD)'))", // brackets within quotes within brackets
"to_char(DATE, ('YYYY,MM,DD)'))", // missing open bracket in quotes
"to_char(DATE, ('YYYY(MM,DD'))", // missing close bracket in quotes
"NAME", 
"to_char(DATE, '(YYYY)MM,DD')", 
"CITY || ', (UK)'", 
"CITY || ', US''s CITY'", // escape a single quote in quotes
"CITY || ', UK'"
};

String[] result = splitElements(source);
assert expected.equals(result);

The first 2 bullet points can be achieved by Splitting on comma outside quotes when escaped quotes exist

This would be really useful when manipulating with SQL. E.g. split the items, append, insert, prepend items etc.

Thanks in advance.

Community
  • 1
  • 1
Max
  • 1,064
  • 9
  • 23

2 Answers2

6

I know it is kind of long but fairly straight-forward, just keep track of how many parens and inside or outside quotes.

String[] splitElements(String source) {
    int parencount = 0;
    boolean q = false;
    List<String> l = new ArrayList<>();
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < source.length(); i++) {
        char c = source.charAt(i);
        switch (c) {
            case ',':
                if (!q && parencount == 0) {
                    l.add(sb.toString());
                    sb.setLength(0);
                } else {
                    sb.append(c);
                }
                break;

            case '(':
                if(!q) parencount++;
                sb.append(c);
                break;

            case ')':
                if(!q) parencount--;
                sb.append(c);
                break;

            case '\'':
                q = ! q;
                sb.append(c);
                break;

            default:
                sb.append(c);
                break;
        }
    }
    String last = sb.toString();
    l.add(last);
    String sa[] = l.toArray(new String[l.size()]);
    return sa;
}
WillShackleford
  • 6,918
  • 2
  • 17
  • 33
  • 3
    Don't use `sb = new StringBuilder()` to reset the builder, just call `sb.setLength(0)`. – Andreas Sep 05 '15 at 17:43
  • This is going to have a problem, at least as far as proper SQL syntax is concerned, if you have closing parenthesis inside quotes. – RealSkeptic Sep 05 '15 at 17:45
  • @WillShackleford Thanks for the reply. The problem is, there could be any number of ( within '', some of which may not have matching close brackets. – Max Sep 05 '15 at 17:45
  • 1
    A string literal with unbalanced parenthesis characters will throw off logic. Add `if (!q)` round `parencount++` and `parencount--`. – Andreas Sep 05 '15 at 17:45
  • @WillShackleford I'll edit the input if you don't mind to cover the un-matching brackets problem. – Max Sep 05 '15 at 17:46
  • put in @Andreas suggestion ( I think) – WillShackleford Sep 05 '15 at 17:50
  • Now you are good. Final comment: No need to copy to array at the beginning. Just use `char c = source.charAt(i)`. Also: Define `l`. – Andreas Sep 05 '15 at 17:53
  • I thought I had them. At any rate they should be back now. – WillShackleford Sep 05 '15 at 17:58
  • @WillShackleford I have removed my comment about missing q and parencount after a refresh. Would you mind making the logic in a method like splitElements() with return type String[] or an array to make the solution more complete? Thanks a lot – Max Sep 05 '15 at 18:03
  • @WillShackleford I have run couple of tests. It seems to always miss the last item in the array list. – Max Sep 05 '15 at 18:22
  • @WillShackleford e.g. The result doesn't contain the last "CITY || ', UK'" – Max Sep 05 '15 at 18:29
  • @WillShackleford I added l.add(sb.toString()); before String sa[] = l.toArray(new String[l.size()]); It now works for all of my tests. Would you mind adding this line? Thanks – Max Sep 05 '15 at 18:51
  • @Ming, yep I need to add that last string. Sorry about the delay in reading your message I had to run out for something. – WillShackleford Sep 05 '15 at 19:41
  • @WillShackleford Thanks – Max Sep 06 '15 at 12:09
1

You can done this by using split method of java class String together with this

Regex:

(?<!\([^\(\)']{0,100}),(?![^\(\)']*\))(?=(?:'[^']*'|[^'])*$)

Explanation:

(?<!\([^\(\)']{0,100}),(?![^\(\)']*\))

Any , which are not surrounded by (...), note that generally negative lookbehind is required a finite quantifier here I arbitrary use {0,100}.

(?=(?:'[^']*'|[^'])*$)

Use lookahead to guarantee that either there is even number of ' which count from , till the end of the string (regex: (?:'[^']*')*$) or there is string which composed of any characters except ' till the end of the string (regex: [^']*$).

Strategy:

  1. Using negative lookahead, negative lookbehind to guarantee that , is out of (...),
  2. Using lookahead to guarantee that there are only pair of 's ahead which count from , till the end of the string.

Code Example:

public class Main{

    public static void main(String[] args) {

        String source = "to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, ('YYYY(MM,DD)')), " +
                        "to_char(DATE, ('YYYY,MM,DD)')), to_char(DATE, ('YYYY(MM,DD')), " +
                        "NAME, to_char(DATE, '(YYYY)MM,DD'), CITY || ', (UK)', " +
                        "CITY || ', US''s CITY', CITY || ', UK', " +
                        "'I am sad :(', to_char(DATE, 'YYYY,MM,DD')";

        String delimiters = "(?<!\\([^\\(\\)']{0,100}),(?![^\\(\\)']*\\))(?=(?:'[^']*'|[^'])*$)";

        String[] tokens = source.split(delimiters);

        for(String token : tokens) {
            System.out.println(token.trim());
        }
    }
}

Output:

to_char(DATE, 'YYYY,MM,DD')
to_char(DATE, ('YYYY(MM,DD)'))
to_char(DATE, ('YYYY,MM,DD)'))
to_char(DATE, ('YYYY(MM,DD'))
NAME
to_char(DATE, '(YYYY)MM,DD')
CITY || ', (UK)'
CITY || ', US''s CITY'
CITY || ', UK'
'I am sad :('
to_char(DATE, 'YYYY,MM,DD')
fronthem
  • 4,011
  • 8
  • 34
  • 55
  • 1
    Thank you so much. Your answer also works for all the test cases I have. I have to say this is really clever. Thanks for also make the explanation so clear as well. – Max Sep 06 '15 at 12:12
  • Just notice this doesn't solve splitting "'I am sad :(', to_char(DATE, 'YYYY,MM,DD')", would you mind taking a look to see if we can enhance the solution? Thanks – Max Sep 07 '15 at 08:28
  • 1
    Thanks a lot. This works really well. I also updated the original post with this scenario. – Max Sep 07 '15 at 09:05
  • Yes it still does. I thought it didn't but I was wrong. Cheers – Max Sep 07 '15 at 10:23
  • @Ming In case if it doesn't work please ask him to update the answer So, later this post can help other people. – fronthem Sep 07 '15 at 10:26
  • 1
    Thanks. I have tested both of your solutions. I tried to accept both of your solutions but Stackoverflow doesn't allow me to do so :( – Max Sep 07 '15 at 10:36