94

I found a brilliant RegEx to extract the part of a camelCase or TitleCase expression.

 (?<!^)(?=[A-Z])

It works as expected:

  • value -> value
  • camelValue -> camel / Value
  • TitleValue -> Title / Value

For example with Java:

String s = "loremIpsum";
words = s.split("(?<!^)(?=[A-Z])");
//words equals words = new String[]{"lorem","Ipsum"}

My problem is that it does not work in some cases:

  • Case 1: VALUE -> V / A / L / U / E
  • Case 2: eclipseRCPExt -> eclipse / R / C / P / Ext

To my mind, the result shoud be:

  • Case 1: VALUE
  • Case 2: eclipse / RCP / Ext

In other words, given n uppercase chars:

  • if the n chars are followed by lower case chars, the groups should be: (n-1 chars) / (n-th char + lower chars)
  • if the n chars are at the end, the group should be: (n chars).

Any idea on how to improve this regex?

Jmini
  • 9,189
  • 2
  • 55
  • 77
  • Seems that you probably would need a conditional modifier on the `^` and another conditional case for capital letters in the negative lookbehind. Haven't tested for sure, but I think that'd be your best bet for fixing the problem. – Nightfirecat Sep 29 '11 at 07:49
  • If anybody is examining – Clam Jun 10 '16 at 06:14

11 Answers11

121

The following regex works for all of the above examples:

public static void main(String[] args)
{
    for (String w : "camelValue".split("(?<!(^|[A-Z]))(?=[A-Z])|(?<!^)(?=[A-Z][a-z])")) {
        System.out.println(w);
    }
}   

It works by forcing the negative lookbehind to not only ignore matches at the start of the string, but to also ignore matches where a capital letter is preceded by another capital letter. This handles cases like "VALUE".

The first part of the regex on its own fails on "eclipseRCPExt" by failing to split between "RPC" and "Ext". This is the purpose of the second clause: (?<!^)(?=[A-Z][a-z]. This clause allows a split before every capital letter that is followed by a lowercase letter, except at the start of the string.

NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • 1
    this one does not work on PHP, while @ridgerunner's does. On PHP it says "lookbehind assertion is not fixed length at offset 13". – igorsantos07 Aug 08 '14 at 23:01
  • 15
    @Igoru: Regex flavours vary. The question is about Java, not PHP, and so is the answer. – NPE Aug 09 '14 at 11:32
  • 1
    while the question is tagged as "java" the question is still generic - besides code samples (that could never be generic). So, if there's a simpler version of this regex and that also works cross-language, I thought someone should point that :) – igorsantos07 Aug 11 '14 at 06:29
  • 7
    @Igoru: The "generic regex" is an imaginary concept. – Casimir et Hippolyte Dec 10 '14 at 19:45
  • why, @CasimiretHippolyte? Aren't regex a resource that every language can use, at some degree? I have never seen a custom implementation of regex besides this one, as usually in opensource you can share a regex between platforms. There's no such a thing as j-regex, although there are extended regex, basic regex and perl regex. Those are standard flavours, and as such Java is just being weird with their home-made version. – igorsantos07 Dec 16 '14 at 13:54
  • 3
    @igorsantos07: No, built-in regex implementations vary wildly between platforms. Some are trying to be Perl-like, some are trying to be POSIX-like, and some are something in between or completely different. – Christoffer Hammarström Mar 08 '17 at 14:29
96

It seems you are making this more complicated than it needs to be. For camelCase, the split location is simply anywhere an uppercase letter immediately follows a lowercase letter:

(?<=[a-z])(?=[A-Z])

Here is how this regex splits your example data:

  • value -> value
  • camelValue -> camel / Value
  • TitleValue -> Title / Value
  • VALUE -> VALUE
  • eclipseRCPExt -> eclipse / RCPExt

The only difference from your desired output is with the eclipseRCPExt, which I would argue is correctly split here.

Addendum - Improved version

Note: This answer recently got an upvote and I realized that there is a better way...

By adding a second alternative to the above regex, all of the OP's test cases are correctly split.

(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])

Here is how the improved regex splits the example data:

  • value -> value
  • camelValue -> camel / Value
  • TitleValue -> Title / Value
  • VALUE -> VALUE
  • eclipseRCPExt -> eclipse / RCP / Ext

Edit:20130824 Added improved version to handle RCPExt -> RCP / Ext case.

ridgerunner
  • 33,777
  • 5
  • 57
  • 69
  • Thanks for your input. I need to separate RCP and Ext in this example, because I convert the parts into a constant name (Style guideline: "all uppercase using underscore to separate words.") In this case, I prefer ECLIPSE_RCP_EXT to ECLIPSE_RCPEXT. – Jmini Sep 29 '11 at 20:47
  • 4
    Thanks for the help; I have modified your regex to add a couple of options to care for digits in the string: `(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|(?<=[0-9])(?=[A-Z][a-z])|(?<=[a-zA-Z])(?=[0-9])` – thoroc Jan 28 '16 at 09:24
  • This is the best answer! Simple and clear. However this answer and the original RegEx by the OP do not work for Javascript & Golang! – Viet Nov 07 '16 at 16:56
  • not work for me – HalfLegend Jun 09 '23 at 13:54
38

Another solution would be to use a dedicated method in commons-lang: StringUtils#splitByCharacterTypeCamelCase

YMomb
  • 2,366
  • 1
  • 27
  • 36
10

I couldn't get aix's solution to work (and it doesn't work on RegExr either), so I came up with my own that I've tested and seems to do exactly what you're looking for:

((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))

and here's an example of using it:

; Regex Breakdown:  This will match against each word in Camel and Pascal case strings, while properly handling acrynoms.
;   (^[a-z]+)                       Match against any lower-case letters at the start of the string.
;   ([A-Z]{1}[a-z]+)                Match against Title case words (one upper case followed by lower case letters).
;   ([A-Z]+(?=([A-Z][a-z])|($)))    Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string.
newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))", "$1 ")
newString := Trim(newString)

Here I'm separating each word with a space, so here are some examples of how the string is transformed:

  • ThisIsATitleCASEString => This Is A Title CASE String
  • andThisOneIsCamelCASE => and This One Is Camel CASE

This solution above does what the original post asks for, but I also needed a regex to find camel and pascal strings that included numbers, so I also came up with this variation to include numbers:

((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))

and an example of using it:

; Regex Breakdown:  This will match against each word in Camel and Pascal case strings, while properly handling acrynoms and including numbers.
;   (^[a-z]+)                               Match against any lower-case letters at the start of the command.
;   ([0-9]+)                                Match against one or more consecutive numbers (anywhere in the string, including at the start).
;   ([A-Z]{1}[a-z]+)                        Match against Title case words (one upper case followed by lower case letters).
;   ([A-Z]+(?=([A-Z][a-z])|($)|([0-9])))    Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string or a number.
newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))", "$1 ")
newString := Trim(newString)

And here are some examples of how a string with numbers is transformed with this regex:

  • myVariable123 => my Variable 123
  • my2Variables => my 2 Variables
  • The3rdVariableIsHere => The 3 rdVariable Is Here
  • 12345NumsAtTheStartIncludedToo => 12345 Nums At The Start Included Too
deadlydog
  • 22,611
  • 14
  • 112
  • 118
  • 1
    Too many unnecessary capturing groups. You could have written it as: `(^[a-z]+|[A-Z][a-z]+|[A-Z]+(?=[A-Z][a-z]|$))` for the first one, and `(^[a-z]+|[0-9]+|[A-Z][a-z]+|[A-Z]+(?=[A-Z][a-z]|$|[0-9]))` for the second one. The outer most can also be removed, but the syntax to refer to the whole match is not portable between languages (`$0` and `$&` are 2 possibilities). – nhahtdh Dec 11 '14 at 07:39
  • The same simplified regexp: `([A-Z]?[a-z]+)|([A-Z]+(?=[A-Z][a-z]))` – Alex Suhinin Nov 07 '19 at 15:54
6

To handle more letters than just A-Z:

s.split("(?<=\\p{Ll})(?=\\p{Lu})|(?<=\\p{L})(?=\\p{Lu}\\p{Ll})");

Either:

  • Split after any lowercase letter, that is followed by uppercase letter.

E.g parseXML -> parse, XML.

or

  • Split after any letter, that is followed by upper case letter and lowercase letter.

E.g. XMLParser -> XML, Parser.


In more readable form:

public class SplitCamelCaseTest {

    static String BETWEEN_LOWER_AND_UPPER = "(?<=\\p{Ll})(?=\\p{Lu})";
    static String BEFORE_UPPER_AND_LOWER = "(?<=\\p{L})(?=\\p{Lu}\\p{Ll})";

    static Pattern SPLIT_CAMEL_CASE = Pattern.compile(
        BETWEEN_LOWER_AND_UPPER +"|"+ BEFORE_UPPER_AND_LOWER
    );

    public static String splitCamelCase(String s) {
        return SPLIT_CAMEL_CASE.splitAsStream(s)
                        .collect(joining(" "));
    }

    @Test
    public void testSplitCamelCase() {
        assertEquals("Camel Case", splitCamelCase("CamelCase"));
        assertEquals("lorem Ipsum", splitCamelCase("loremIpsum"));
        assertEquals("XML Parser", splitCamelCase("XMLParser"));
        assertEquals("eclipse RCP Ext", splitCamelCase("eclipseRCPExt"));
        assertEquals("VALUE", splitCamelCase("VALUE"));
    }    
}
Community
  • 1
  • 1
Christoffer Hammarström
  • 27,242
  • 4
  • 49
  • 58
4

Brief

Both top answers here provide code using positive lookbehinds, which, is not supported by all regex flavours. The regex below will capture both PascalCase and camelCase and can be used in multiple languages.

Note: I do realize this question is regarding Java, however, I also see multiple mentions of this post in other questions tagged for different languages, as well as some comments on this question for the same.

Code

See this regex in use here

([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b)

Results

Sample Input

eclipseRCPExt

SomethingIsWrittenHere

TEXTIsWrittenHERE

VALUE

loremIpsum

Sample Output

eclipse
RCP
Ext

Something
Is
Written
Here

TEXT
Is
Written
HERE

VALUE

lorem
Ipsum

Explanation

  • Match one or more uppercase alpha character [A-Z]+
  • Or match zero or one uppercase alpha character [A-Z]?, followed by one or more lowercase alpha characters [a-z]+
  • Ensure what follows is an uppercase alpha character [A-Z] or word boundary character \b
ctwheels
  • 21,901
  • 9
  • 42
  • 77
4

You can use StringUtils.splitByCharacterTypeCamelCase("loremIpsum") from Apache Commons Lang.

infomehdi
  • 93
  • 1
  • 3
0

You can use the expression below for Java:

(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|(?=[A-Z][a-z])|(?<=\\d)(?=\\D)|(?=\\d)(?<=\\D)
  • 3
    Hi Maicon, welcome to StackOverflow and thank you for your answer. While this may answer the question, it doesn't provide any explanation for others to learn _how_ it solves the problem. Could you edit your answer to include an explanation of your code? Thank you! – Tim Malone Jul 10 '16 at 23:46
0

Instead of looking for separators that aren't there you might also considering finding the name components (those are certainly there):

String test = "_eclipse福福RCPExt";

Pattern componentPattern = Pattern.compile("_? (\\p{Upper}?\\p{Lower}+ | (?:\\p{Upper}(?!\\p{Lower}))+ \\p{Digit}*)", Pattern.COMMENTS);

Matcher componentMatcher = componentPattern.matcher(test);
List<String> components = new LinkedList<>();
int endOfLastMatch = 0;
while (componentMatcher.find()) {
    // matches should be consecutive
    if (componentMatcher.start() != endOfLastMatch) {
        // do something horrible if you don't want garbage in between

        // we're lenient though, any Chinese characters are lucky and get through as group
        String startOrInBetween = test.substring(endOfLastMatch, componentMatcher.start());
        components.add(startOrInBetween);
    }
    components.add(componentMatcher.group(1));
    endOfLastMatch = componentMatcher.end();
}

if (endOfLastMatch != test.length()) {
    String end = test.substring(endOfLastMatch, componentMatcher.start());
    components.add(end);
}

System.out.println(components);

This outputs [eclipse, 福福, RCP, Ext]. Conversion to an array is of course simple.

Maarten Bodewes
  • 90,524
  • 13
  • 150
  • 263
0

I can confirm that the regex string ([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b) given by ctwheels, above, works with the Microsoft flavour of regex.

I would also like to suggest the following alternative, based on ctwheels' regex, which handles numeric characters: ([A-Z0-9]+|[A-Z]?[a-z]+)(?=[A-Z0-9]|\b).

This able to split strings such as:

DrivingB2BTradeIn2019Onwards

to

Driving B2B Trade in 2019 Onwards

William Bell
  • 182
  • 2
  • 10
-1

A JavaScript Solution

/**
 * howToDoThis ===> ["", "how", "To", "Do", "This"]
 * @param word word to be split
 */
export const splitCamelCaseWords = (word: string) => {
    if (typeof word !== 'string') return [];
    return word.replace(/([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b)/g, '!$&').split('!');
};
Akshay Vijay Jain
  • 13,461
  • 8
  • 60
  • 73
  • They ask for a JavaScript solution.And why are you giving twice the [same solution](https://stackoverflow.com/a/63127134/372239)? If you think that those questions are indentical, vote to close one as duplicate. – Toto Jul 28 '20 at 08:37
  • I was curious to try this on strings containing numbers and it seems to treat it as part of the previous strings. It doesn't seem to work well on this example: `'DrivingB2BTradeIn2019Onwards'` would return `["", "DrivingB2", "B", "TradeIn2019", "Onwards"]` – kimbaudi Mar 14 '21 at 21:17