4

I've seen many examples, but I am not getting the expected result.

Given a String:

"manikanta, Santhosh, ramakrishna(mani, santhosh), tester"

I would like to get the String array as follows:

manikanta,
Santhosh,
ramakrishna(mani, santhosh),
tester

I tried the following regex (got from another example):

"(\".*?\"|[^\",\\s]+)(?=\\s*,|\\s*$)"
Bohemian
  • 412,405
  • 93
  • 575
  • 722
Manikanta Reddy
  • 849
  • 9
  • 23
  • I copied the regular expression from javascript example but my problem is in java only – Manikanta Reddy Oct 14 '15 at 07:18
  • just ignore the example regular expression input string is what i have given and i wanna get the below output – Manikanta Reddy Oct 14 '15 at 07:20
  • 2
    good question as is, but it would be even better/more comfortable if you gave a minimal code example, like a class with only a main method with the input string and the parse/split and output code. or even a junit test case. – hoijui Oct 14 '15 at 07:21
  • Try [`,(?![^()]*[)])`](https://regex101.com/r/eP2nD1/1) with `split()`. – Wiktor Stribiżew Oct 14 '15 at 07:22
  • Thans @Hoijui it is working – Manikanta Reddy Oct 14 '15 at 07:24
  • Remove the quotes and try `[^,\(]+(\([^\)]+\))?` – Jasper de Vries Oct 14 '15 at 07:26
  • 1
    This problem can probably not be solved by regular expressions (depends on the exact problem). Background: Depending on the context (brackets / quotes) a comma should not be considered for splitting or not. This would mean that the expressions you want to parse belong to a context-sensitive language (see chomsky hierarchy). However the languages which could be detected / parsed by regular expressions are exactly the set of regular languages. Hence if your expressions belong to a context-sensitive non-regular language you can not parse them with regular expressions. – Sebastian Oct 14 '15 at 07:30
  • @Sebastian could it still be regular if only one level of parentheses is allowed / considered (no nesting)? – Jiri Tousek Oct 14 '15 at 07:41
  • @Sebastian http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to-match-nested-patterns – zapl Oct 14 '15 at 07:44
  • @Jiri: I'm not sure. I think so. But even then something like "a(c,d)(e,f)" should turn it into non-regular. Furthermore using regular expressions with "*" or "+" can enable clients to perform denial of service attacks if you provide a service, use a problematic regular expression and clients can sent you input exploiting the weakness of your regex. – Sebastian Oct 14 '15 at 07:45
  • @zapl: Thanks for the hint. Okay then it depends in the programming language whether "regular expression" means a "regular expression" in formal languages. However I'm irriated as the comments state that a context free languages is sufficient. But what if you parse a,b\,c\\,d,e where "\" can be used to escape characters and there can be an aribitrarily long sequence of "\"? I think you need something context sensitive. Why is context free sufficient? – Sebastian Oct 14 '15 at 07:55
  • @Sebastian I think your example is actually even regular: A -> [a|b|c|d|e]A, A -> \B, B -> [,\]A, A -> empty. – Jiri Tousek Oct 14 '15 at 08:11
  • @Sebastian you are correct. Modern "regex libraries" go beyond "formal regular expressions", allowing recursion and some context sensitive structures. Unfortunately, Java does not implement recursion, [but it can provide a solution to some of the examples you named](http://ideone.com/6jaoeG) – Mariano Oct 14 '15 at 08:28
  • @Jiri: It should be non-regular as "b\,c\\" should be translated into "bc\". Hence you can not simply replace any "\" by empty / some specific symbol. For simplification: Try to write a regular expression which matches an arbitrarily long sequence of "\" of odd length but none of even length. – Sebastian Oct 14 '15 at 08:29
  • @Sebastian check the link in my previous comment for an arbitrary long sequence of `"\\"` – Mariano Oct 14 '15 at 08:30
  • Hello Dudes my problem already resolved @Sebastian – Manikanta Reddy Oct 14 '15 at 08:52
  • 1
    All this talk about formal languages really doesn't help anything. The vast majority of regex users don't know what you're talking about and don't need to. And those few who *do* possess such knowledge are handicapped by it, as you've just demonstrated. :P – Alan Moore Oct 14 '15 at 11:13
  • Just to clarify: Matching sequences with an even / odd number of characters is no problem (for example "(aa)*") but if you want to treat a following character differently depending that it goes drastically more complex - I think. But more important @Alan Moore is mostly right. Unfortunately there are enough developers which try to implement stuff they do not completely understand which can cause a fatal impact on users (especially if it goes about security). Hence we should resist such behaviour. About my inaccuracy: Sorry, I want to help but can not afford to spent more time on that. – Sebastian Oct 15 '15 at 07:13

3 Answers3

5

This does this trick:

String[] parts = input.split(", (?![^(]*\\))");

which employs a negative lookahead to assert that the next bracket char is not a close bracket, and produces:

manikanta
Santhosh
ramakrishna(mani, santhosh)
tester

The desired output as per your question keeps the trailing commas, which I assume is an oversight, but if you really do want to keep the commas:

String[] parts = input.split("(?<=,) (?![^(]*\\))");

which produces the same, but with the trailing commas intact:

manikanta,
Santhosh,
ramakrishna(mani, santhosh),
tester
Bohemian
  • 412,405
  • 93
  • 575
  • 722
0

Suppose, we can split with whitespaces (due to your example), then you can try this regex \s+(?=([^\)]*\()|([^\)\(]*$)) like:

String str = "manikanta, Santhosh, ramakrishna(mani, santhosh), ramakrishna(mani, santhosh), tester";
String[] ar = str.split("\\s+(?=([^\\)]*\\()|([^\\)\\(]*$))");

Where:

\s+ any number of whitespaces

(?=...) positive lookahead, means that after current position must be the string, that matches to ([^\\)]*\\() or | to ([^\\)\\(]*$)

([^\\)]*\\() ignores whitespaces inside the ( and )

([^\\)\\(]*$)) all whitespaces, if they are not followed by ( and ), here is used to split a part with the tester word

Stanislav
  • 27,441
  • 9
  • 87
  • 82
-1

As I stated in my comment to the question this problem may be impossible to solve by regular expressions.

The following code (java) gives a hint what to do:

private void parse() {
    String string = null;
    char[] chars = string.toCharArray();
    List<String> parts = new ArrayList<String>();

    boolean split = true;
    int lastEnd = 0;
    for (int i = 0; i < chars.length; i++) {
        char c = chars[i];
        switch (c) {
        case '(':
            split = false;
            break;
        case ')':
            split = true;
            break;
        }
        if (split && c == ',') {
            parts.add(string.substring(lastEnd, i - 1));
            lastEnd = i++;
        }
    }
}

Note that the code lacks some checks for constraints (provided string is null, array borders, ...).

Sebastian
  • 395
  • 2
  • 7