2

I want to skip first occurrence if no of occurrence more than 4. For now I will get max of 5 number underscore occurrence. I need to produce the output A_B, C, D, E, F and I did using below code. I want better solution. Please check and let me know. Thanks in advance.

String key = "A_B_C_D_E_F";
int occurance = StringUtils.countOccurrencesOf(key, "_");
System.out.println(occurance);
String[] keyValues = null;
if(occurance == 5){
    key = key.replaceFirst("_", "-");
    keyValues = StringUtils.tokenizeToStringArray(key, "_");
    keyValues[0] = replaceOnce(keyValues[0], "-", "_");
}else{
    keyValues = StringUtils.tokenizeToStringArray(key, "_");
}

for(String keyValue : keyValues){
    System.out.println(keyValue);
}
Termininja
  • 6,620
  • 12
  • 48
  • 49
Abdul
  • 942
  • 2
  • 14
  • 32

5 Answers5

2

You can use this regex to split:

String s = "A_B_C_D_E_F";
String[] list = s.split("(?<=_[A-Z])_");

Output:

[A_B, C, D, E, F]

The idea is to match only the _ who are preceded by "_[A-Z]", which effectively skips only the first one.

If the strings you are considering have a different format between the "_", you have to replace [A-Z] by the appropriate regex

Maljam
  • 6,244
  • 3
  • 17
  • 30
  • just as a sidenote, the only problem is that this doesn´t work anymore once the second character has an undefined length. But this works perfectly for the example. – SomeJavaGuy Apr 08 '16 at 10:17
  • @KevinEsche well obviously you have to change `[A-Z]` to match whatever is in between the `_`, but the idea would be exactly the same – Maljam Apr 08 '16 at 10:19
  • Which will not work as you cannot use `*` or `+` in a look-behind in Java and your solution also does not respect the "skip if more than 4 are present" requirement. – Vampire Apr 08 '16 at 10:48
  • @BjörnKautler Why not? I just tested `"(?<=_+[A-Z])_+"` and it works fine when I test it. So what do you mean? – Maljam Apr 08 '16 at 10:50
  • Hm, actually I wonder now that this works, but try `"(?<=_+[A-Z]+)_+"` – Vampire Apr 08 '16 at 10:53
  • It seems if you repeat anything but the first character in a look-behind with `+` or `*` you hit this problem. – Vampire Apr 08 '16 at 10:56
  • @BjörnKautler yeah... `(?<=_+[A-Z]+)_+` does not compile.. I didn't know that. Why is that? – Maljam Apr 08 '16 at 10:57
  • As far as I remember, to check the look-behind the Java regex engine has to save the content that was there before the actual thing the look-behind is attached to. If this is unbound, the Regex engine would potentially have to hold all input always and it would too likely overflow the memory at some point or something like that. – Vampire Apr 08 '16 at 11:00
  • Here you can read more about the limitations of look-behind in various Regex libraries: http://www.regular-expressions.info/lookaround.html#limitbehind From that description I still wonder that '*' and '+' work on the first character of the look-behind regex. – Vampire Apr 08 '16 at 11:06
2

Well, it is relatively "simple":

String str = "A_B_C_D_E_F_G";
String[] result = str.split("(?<!^[^_]*)_|_(?=(?:[^_]*_){0,3}[^_]*$)");
System.out.println(Arrays.toString(result));

Here a version with comments for better understanding that can also be used as is:

String str = "A_B_C_D_E_F_G";
String[] result = str.split("(?x)                  # enable embedded comments \n"
                            + "                    # first alternative splits on all but the first underscore \n"
                            + "(?<!                # next character should not be preceded by \n"
                            + "    ^[^_]*          #     only non-underscores since beginning of input \n"
                            + ")                   # so this matches only if there was an underscore before \n"
                            + "_                   # underscore \n"
                            + "|                   # alternatively split if an underscore is followed by at most three more underscores to match the less than five underscores case \n"
                            + "_                   # underscore \n"
                            + "(?=                 # preceding character must be followed by \n"
                            + "    (?:[^_]*_){0,3} #     at most three groups of non-underscores and an underscore \n"
                            + "    [^_]*$          #     only more non-underscores until end of line \n"
                            + ")");
System.out.println(Arrays.toString(result));
Vampire
  • 35,631
  • 4
  • 76
  • 102
0

You can use this regex based on \G and instead of splitting use matching:

String str = "A_B_C_D_E_F";
Pattern p = Pattern.compile("(^[^_]*_[^_]+|\\G[^_]+)(?:_|$)");
Matcher m = p.matcher(str);
List<String> resultArr = new ArrayList<>();
while (m.find()) {
    resultArr.add( m.group(1) );
}
System.err.println(resultArr);

\G asserts position at the end of the previous match or the start of the string for the first match.

Output:

[A_B, C, D, E, F]

RegEx Demo

anubhava
  • 761,203
  • 64
  • 569
  • 643
0

I would do it after the split.

public void test() {
    String key = "A_B_C_D_E_F";
    String[] parts = key.split("_");
    if (parts.length >= 5) {
        String[] newParts = new String[parts.length - 1];
        newParts[0] = parts[0] + "-" + parts[1];
        System.arraycopy(parts, 2, newParts, 1, parts.length - 2);
        parts = newParts;
    }
    System.out.println("parts = " + Arrays.toString(parts));
}
OldCurmudgeon
  • 64,482
  • 16
  • 119
  • 213
0

Although Java does not say that officially, you can use * and + in the lookbehind as they are implemented as limiting quantifiers: * as {0,0x7FFFFFFF} and + as {1,0x7FFFFFFF} (see Regex look-behind without obvious maximum length in Java). So, if your strings are not too long, you can use

String key = "A_B_C_D";       // => [A, B, C, D]
//String key = "A_B_C_D_E_F"; // => [A_B, C, D, E, F]
String[] res = null;
if (key.split("_").length > 4) {
    res = key.split("(?<!^[^_]*)_");
} else {
    res = key.split("_");
}
System.out.println(Arrays.toString(res));

See the JAVA demo

DISCLAIMER: Since this is an exploit of the current Java 8 regex engine, the code may break in the future when the bug is fixed in Java.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563