76

What I was trying to achieve, was something like this:

>>> camel_case_split("CamelCaseXYZ")
['Camel', 'Case', 'XYZ']
>>> camel_case_split("XYZCamelCase")
['XYZ', 'Camel', 'Case']

So I searched and found this perfect regular expression:

(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])

As the next logical step I tried:

>>> re.split("(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])", "CamelCaseXYZ")
['CamelCaseXYZ']

Why does this not work, and how do I achieve the result from the linked question in python?

Edit: Solution summary

I tested all provided solutions with a few test cases:

string:                 ''
AplusKminus:            ['']
casimir_et_hippolyte:   []
two_hundred_success:    []
kalefranz:              string index out of range # with modification: either [] or ['']

string:                 ' '
AplusKminus:            [' ']
casimir_et_hippolyte:   []
two_hundred_success:    [' ']
kalefranz:              [' ']

string:                 'lower'
all algorithms:         ['lower']

string:                 'UPPER'
all algorithms:         ['UPPER']

string:                 'Initial'
all algorithms:         ['Initial']

string:                 'dromedaryCase'
AplusKminus:            ['dromedary', 'Case']
casimir_et_hippolyte:   ['dromedary', 'Case']
two_hundred_success:    ['dromedary', 'Case']
kalefranz:              ['Dromedary', 'Case'] # with modification: ['dromedary', 'Case']

string:                 'CamelCase'
all algorithms:         ['Camel', 'Case']

string:                 'ABCWordDEF'
AplusKminus:            ['ABC', 'Word', 'DEF']
casimir_et_hippolyte:   ['ABC', 'Word', 'DEF']
two_hundred_success:    ['ABC', 'Word', 'DEF']
kalefranz:              ['ABCWord', 'DEF']

In summary you could say the solution by @kalefranz does not match the question (see the last case) and the solution by @casimir et hippolyte eats a single space, and thereby violates the idea that a split should not change the individual parts. The only difference among the remaining two alternatives is that my solution returns a list with the empty string on an empty string input and the solution by @200_success returns an empty list. I don't know how the python community stands on that issue, so I say: I am fine with either one. And since 200_success's solution is simpler, I accepted it as the correct answer.

Alex Waygood
  • 6,304
  • 3
  • 24
  • 46
AplusKminus
  • 1,542
  • 1
  • 19
  • 32
  • Other Qs to do what you're trying to do: [first](http://stackoverflow.com/q/21326963/1578604), [second](http://stackoverflow.com/q/17361768/1578604) and I'm pretty sure there are others. – Jerry Apr 28 '15 at 09:57
  • How is it `ABC` CamelCase?! – mihai Apr 28 '15 at 10:48
  • 1
    @Mihai I do not understand your question. If you wonder how the regex performs on `"ABCCamelCase"`, it works as expected: `['ABC', 'Camel', 'Case']`. If you interpreted `ABC` to stand for [AbstractBaseClass](https://docs.python.org/3/library/abc.html), then I am sorry for the confusion, as `ABC` is just three arbitrary uppercase letters in my question. – AplusKminus Apr 28 '15 at 10:54
  • Read [my answer to a similar question](http://stackoverflow.com/questions/5020906#9283563). – Matthias Apr 28 '15 at 10:56
  • 1
    Also a good answer, but I did not find the question as the wording was too specific for my search. Also your answer does not quite do what is asked for here, as it produces a converted string with an arbitrary separation character which you would need to split with `str.split(' ')`, instead of a (more versatile) list of its parts. – AplusKminus Apr 28 '15 at 11:06
  • @SheridanVespo, `ABC` is just uppercase, not camel case. – mihai Apr 28 '15 at 11:52
  • Look at the questions linked. I included the upper case part to address the common wish of being able to split something like "someHTMLFile" into `['some', 'HTML', 'File']`. – AplusKminus Apr 28 '15 at 12:16

16 Answers16

64

As @AplusKminus has explained, re.split() never splits on an empty pattern match. Therefore, instead of splitting, you should try finding the components you are interested in.

Here is a solution using re.finditer() that emulates splitting:

def camel_case_split(identifier):
    matches = finditer('.+?(?:(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$)', identifier)
    return [m.group(0) for m in matches]
AplusKminus
  • 1,542
  • 1
  • 19
  • 32
200_success
  • 7,286
  • 1
  • 43
  • 74
  • I found one difference (according to my test cases) between your solution and mine: `camel_case_split("")` returns `[]`in your case and `[""]` in mine. The question is, which of those you would rather consider to be expected. Since either one works in my application, I consider this to be a valid answer! – AplusKminus Apr 28 '15 at 13:05
  • Another question that remains, is whether this, or my proposed solution performs better. I am no expert on the complexity of regular expressions, so this would have to be evaluated by someone else. – AplusKminus Apr 28 '15 at 13:14
  • Our regexes are basically the same, except that mine starts with a `.+?` that captures the text instead of discarding it, and ends with a `$` to make it go all the way to the end. Neither change changes the search strategy. – 200_success Apr 28 '15 at 13:22
  • 1
    Doesn't support digits. For example, `"L2S"` is not split into `["L2", "S"]` . Use `[a-z0-9]` rather than `[a-z]` in the above regular expression to fix this. – Neapolitan Oct 06 '16 at 14:48
  • @Neapolitan The question seemed not to want a split there. – 200_success Oct 06 '16 at 14:53
  • ***Parse 1*** .+? My Doubt : What is the use of .+? here .(any character) +(one or more) ?(zero or one) This is highly level group (?:(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$) – Ravi Yadav May 16 '17 at 07:44
  • ***Parse 2*** ?: My Doubt : What is the use of ?: here ? (?:...) Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern This part contains 3 regular expression with or (?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$) (?<=[a-z])(?=[A-Z]) (?<=...) Matches if the current position in the string is preceded by a match for ... (?=[A-Z][a-z]) (?=...) Matches if ... matches next, but doesn’t consume any of the string. '$' Matches the end of the string – Ravi Yadav May 16 '17 at 07:45
  • 1
    @200_success ***Parse 1*** and ***parse 2*** are my analysis and I din't really get the regular expression . Can you help on this here ? – Ravi Yadav May 16 '17 at 07:46
51

Use re.sub() and split()

import re

name = 'CamelCaseTest123'
splitted = re.sub('([A-Z][a-z]+)', r' \1', re.sub('([A-Z]+)', r' \1', name)).split()

Result

'CamelCaseTest123' -> ['Camel', 'Case', 'Test123']
'CamelCaseXYZ' -> ['Camel', 'Case', 'XYZ']
'XYZCamelCase' -> ['XYZ', 'Camel', 'Case']
'XYZ' -> ['XYZ']
'IPAddress' -> ['IP', 'Address']
Jossef Harush Kadouri
  • 32,361
  • 10
  • 130
  • 129
  • 3
    Best answer so far IMHO, elegant and effective, should be the selected answer. – Pierrick Bruneau Apr 26 '19 at 09:12
  • 3
    Nice, even just `re.sub('([A-Z]+)', r' \1', name).split()` works for simple cases when you don't have inputs like `'XYZCamelCase'` and `'IPAddress'` (or if you're ok with getting `['XYZCamel', 'Case']` and `['IPAddress']` for them). The other `re.sub` accounts for these cases too (making each sequence of lowercase letters be attached to only one preceding uppercase letter). – ShreevatsaR Apr 08 '21 at 06:40
  • @PierrickBruneau, while I agree that this answer is elegant and effective, I find it lacking in an important aspect of general Q&A-site etiquette: It does not answer the question. Well, at least not fully, since no explanation is given as to why the attempt of the question does not work. – AplusKminus Apr 19 '22 at 11:50
  • @AplusKminus, I'm answering new visitors who google "python camel case split" and land here. IMO they seek a general copy-pasteable snippet and do not have your specific issue (since they start from scratch). Therefore no need for such an explanation. This is why all of my "late" answers are like this. I'm doing this purposely. If I were answering in 2015 and targeting this answer to you, you would see such an explanation – Jossef Harush Kadouri Apr 23 '22 at 10:13
13

Most of the time when you don't need to check the format of a string, a global research is more simple than a split (for the same result):

re.findall(r'[A-Z](?:[a-z]+|[A-Z]*(?=[A-Z]|$))', 'CamelCaseXYZ')

returns

['Camel', 'Case', 'XYZ']

To deal with dromedary too, you can use:

re.findall(r'[A-Z]?[a-z]+|[A-Z]+(?=[A-Z]|$)', 'camelCaseXYZ')

Note: (?=[A-Z]|$) can be shorten using a double negation (a negative lookahead with a negated character class): (?![^A-Z])

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • @SheridanVespo: This is a way only for camel, not for dromedary (as asked). But it's possible to do it in the same way with few changes. – Casimir et Hippolyte Apr 28 '15 at 14:20
  • @SheridanVespo: Yes "dromedary-case" doesn't exist, but since the dromedary has only one hump, and the camel two... About efficiency: it is not the pattern itself but all the code after that you avoid since you obtain directly the list of strings you want. About lookarounds in general: lookarounds do not come straight from hell and are not so slow (they can slow down a pattern only if they are badly used). As I was saying to an other SO user there's a few minutes, there are cases where you can optimize a pattern with lookaheads. – Casimir et Hippolyte Apr 28 '15 at 15:17
  • Measured all posted solutions. Yours and `mnesarco's` one passed all of the `Setop's` tests and turned out to be the fastest. – Ledorub Nov 17 '21 at 10:45
12

Working solution, without regexp

I am not that good at regexp. I like to use them for search/replace in my IDE but I try to avoid them in programs.

Here is a quite straightforward solution in pure python:

def camel_case_split(s):
    idx = list(map(str.isupper, s))
    # mark change of case
    l = [0]
    for (i, (x, y)) in enumerate(zip(idx, idx[1:])):
        if x and not y:  # "Ul"
            l.append(i)
        elif not x and y:  # "lU"
            l.append(i+1)
    l.append(len(s))
    # for "lUl", index of "U" will pop twice, have to filter that
    return [s[x:y] for x, y in zip(l, l[1:]) if x < y]






And some tests

TESTS = [
    ("XYZCamelCase", ['XYZ', 'Camel', 'Case']),
    ("CamelCaseXYZ", ['Camel', 'Case', 'XYZ']),
    ("CamelCaseXYZa", ['Camel', 'Case', 'XY', 'Za']),
    ("XYZCamelCaseXYZ", ['XYZ', 'Camel', 'Case', 'XYZ']),
    ("aCamelCaseWordT", ['a', 'Camel', 'Case', 'Word', 'T']),
    ("CamelCaseWordT", ['Camel', 'Case', 'Word', 'T']),
    ("CamelCaseWordTa", ['Camel', 'Case', 'Word', 'Ta']),
    ("aCamelCaseWordTa", ['a', 'Camel', 'Case', 'Word', 'Ta']),
    ("Ta", ['Ta']),
    ("aT", ['a', 'T']),
    ("a", ['a']),
    ("T", ['T']),
    ("", []),
]

def test():
    for (q,a) in TESTS:
        assert camel_case_split(q) == a

if __name__ == "__main__":
    test()

Edit: a solution which streams data in one pass

This solution leverages the fact that the decision to split word or not can be taken locally, just considering the current character and the previous one.

def camel_case_split(s):
    u = True  # case of previous char
    w = b = ''  # current word, buffer for last uppercase letter
    for c in s:
        o = c.isupper()
        if u and o:
            w += b
            b = c
        elif u and not o:
            if len(w)>0:
                yield w
            w = b + c
            b = ''
        elif not u and o:
            yield w
            w = ''
            b = c
        else:  # not u and not o:
            w += c
        u = o
    if len(w)>0 or len(b)>0:  # flush
        yield w + b

It is theoretically faster and lesser memory usage.

same tests suite applies

but list must be built by caller

def test():
    for (q,a) in TESTS:
        r = list(camel_case_split(q))
        print(q,a,r)
        assert r == a

Try it online

Setop
  • 2,262
  • 13
  • 28
  • 2
    Thank you, this is readable, it works, and it has tests! Much better than the regexp solutions, in my opinion. – antimirov May 13 '20 at 14:41
  • Just a heads up this breaks on `World_Wide_Web` => `['World_', 'Wide_', 'Web']`. Also it breaks here `ISO100` => `['IS', 'O100']` – stwhite Oct 30 '20 at 06:51
  • @stwhite, these inputs are not considered in the original question. And if underscore and digits are considered lowercase, output is correct. So this does not break, this just does what is has to do. Other solutions may have different behaviors but again, this is not part of the initial problem. – Setop Jan 22 '21 at 08:29
6

I just stumbled upon this case and wrote a regular expression to solve it. It should work for any group of words, actually.

RE_WORDS = re.compile(r'''
    # Find words in a string. Order matters!
    [A-Z]+(?=[A-Z][a-z]) |  # All upper case before a capitalized word
    [A-Z]?[a-z]+ |  # Capitalized words / all lower case
    [A-Z]+ |  # All upper case
    \d+  # Numbers
''', re.VERBOSE)

The key here is the lookahead on the first possible case. It will match (and preserve) uppercase words before capitalized ones:

assert RE_WORDS.findall('FOOBar') == ['FOO', 'Bar']
emyller
  • 2,648
  • 1
  • 24
  • 16
  • 1
    I like this one because it's clearer, and it does a better job for "strings people enter in real-life" like `URLFinder` and `listURLReader`. – Tom Swirly Jul 16 '18 at 12:47
5
import re

re.split('(?<=[a-z])(?=[A-Z])', 'camelCamelCAMEL')
# ['camel', 'Camel', 'CAMEL'] <-- result

# '(?<=[a-z])'         --> means preceding lowercase char (group A)
# '(?=[A-Z])'          --> means following UPPERCASE char (group B)
# '(group A)(group B)' --> 'aA' or 'aB' or 'bA' and so on
endusol
  • 61
  • 1
  • 3
3

The documentation for python's re.split says:

Note that split will never split a string on an empty pattern match.

When seeing this:

>>> re.findall("(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])", "CamelCaseXYZ")
['', '']

it becomes clear, why the split does not work as expected. The remodule finds empty matches, just as intended by the regular expression.

Since the documentation states that this is not a bug, but rather intended behavior, you have to work around that when trying to create a camel case split:

def camel_case_split(identifier):
    matches = finditer('(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])', identifier)
    split_string = []
    # index of beginning of slice
    previous = 0
    for match in matches:
        # get slice
        split_string.append(identifier[previous:match.start()])
        # advance index
        previous = match.start()
    # get remaining string
    split_string.append(identifier[previous:])
    return split_string
AplusKminus
  • 1,542
  • 1
  • 19
  • 32
3

This solution also supports numbers, spaces, and auto remove underscores:

def camel_terms(value):
    return re.findall('[A-Z][a-z]+|[0-9A-Z]+(?=[A-Z][a-z])|[0-9A-Z]{2,}|[a-z0-9]{2,}|[a-zA-Z0-9]', value)

Some tests:

tests = [
    "XYZCamelCase",
    "CamelCaseXYZ",
    "Camel_CaseXYZ",
    "3DCamelCase",
    "Camel5Case",
    "Camel5Case5D",
    "Camel Case XYZ"
]

for test in tests:
    print(test, "=>", camel_terms(test))

results:

XYZCamelCase => ['XYZ', 'Camel', 'Case']
CamelCaseXYZ => ['Camel', 'Case', 'XYZ']
Camel_CaseXYZ => ['Camel', 'Case', 'XYZ']
3DCamelCase => ['3D', 'Camel', 'Case']
Camel5Case => ['Camel', '5', 'Case']
Camel5Case5D => ['Camel', '5', 'Case', '5D']
Camel Case XYZ => ['Camel', 'Case', 'XYZ']
mnesarco
  • 2,619
  • 23
  • 31
  • Is this regex utilizing the fact that the first matching alternative will stop the processor from looking at the others? Otherwise I don't understand `[a-z0-9]{2,}` or `[a-zA-Z0-9]`. – AplusKminus Feb 10 '21 at 16:01
  • It is because in my usecase, i need to support "3D", but also need to support "3 D" if the input is already separated with spaces or underscores. This solution comes from my own requirement which has more cases than the original question. And yes, I use the fact that first match wins. – mnesarco Feb 10 '21 at 19:25
2

Simple solution:

re.sub(r"([a-z0-9])([A-Z])", r"\1 \2", str(text))
vbfh
  • 115
  • 1
  • 8
1

Here's another solution that requires less code and no complicated regular expressions:

def camel_case_split(string):
    bldrs = [[string[0].upper()]]
    for c in string[1:]:
        if bldrs[-1][-1].islower() and c.isupper():
            bldrs.append([c])
        else:
            bldrs[-1].append(c)
    return [''.join(bldr) for bldr in bldrs]

Edit

The above code contains an optimization that avoids rebuilding the entire string with every appended character. Leaving out that optimization, a simpler version (with comments) might look like

def camel_case_split2(string):
    # set the logic for creating a "break"
    def is_transition(c1, c2):
      return c1.islower() and c2.isupper()

    # start the builder list with the first character
    # enforce upper case
    bldr = [string[0].upper()]
    for c in string[1:]:
        # get the last character in the last element in the builder
        # note that strings can be addressed just like lists
        previous_character = bldr[-1][-1]
        if is_transition(previous_character, c):
            # start a new element in the list
            bldr.append(c)
        else:
            # append the character to the last string
            bldr[-1] += c
    return bldr
kalefranz
  • 4,612
  • 2
  • 27
  • 42
  • @SheridanVespo I think the first version may have had an extraneous `)` that you caught and corrected for me :) – kalefranz Apr 28 '15 at 12:49
  • @SheridanVespo Apparently there are [varied definitions](https://en.wikipedia.org/wiki/CamelCase) for camel case. Some definitions (and the one I was originally assuming) enforce the first letter being capitalized. No worries; the "bug" is an easy fix. Just remove the `.upper()` call when initializing the list. – kalefranz Apr 28 '15 at 13:34
  • Can you create a version that satisfies the cases in the [linked answer](http://stackoverflow.com/a/7599674/1654255)? Also, is there a way to compare performance of your method and the one by @Casimir et Hippolyte? – AplusKminus Apr 28 '15 at 14:56
1

Based on @Setop's answer, I added support for numbers, whitespaces, underscores and dots:

def _camel_case_split_iter(string: str) -> Iterable[str]:
    previous_char_upper = True
    previous_char_digit = True
    curr_word = ""
    upper_buffer = ""  # buffer for last uppercase letter
    for c in string:
        curr_char_upper = c.isupper()
        curr_char_digit = c.isdigit()
        if c.isspace() or c in ["_", "."]:
            if len(curr_word) > 0 or len(upper_buffer) > 0:
                yield curr_word + upper_buffer
                curr_word = upper_buffer = ""
        elif previous_char_upper and curr_char_upper:
            curr_word += upper_buffer
            upper_buffer = c
        elif previous_char_upper and not curr_char_upper and not curr_char_digit:
            if len(curr_word) > 0:
                yield curr_word
            curr_word = upper_buffer + c
            upper_buffer = ""
        elif not previous_char_upper and curr_char_upper:
            if len(curr_word) > 0:
                yield curr_word
                curr_word = ""
            upper_buffer = c
        elif (not previous_char_digit and curr_char_digit) or (previous_char_digit and not curr_char_digit):
            if len(curr_word) > 0 or len(upper_buffer) > 0:
                yield curr_word + upper_buffer
                upper_buffer = ""
            curr_word = c
        else:
            curr_word += c
        previous_char_upper = curr_char_upper
        previous_char_digit = curr_char_digit
    if len(curr_word) > 0 or len(upper_buffer) > 0:  # flush
        yield curr_word + upper_buffer


def camel_case_split(string: str) -> list[str]:
    """
    Split CamelCase string to words.

    >>> camel_case_split("XYZCamelCaseXYZ")
    ['XYZ', 'Camel', 'Case', 'XYZ']
    >>> camel_case_split("Ta")
    ['Ta']
    >>> camel_case_split("aT")
    ['a', 'T']
    >>> camel_case_split("_aAa_bBb__CCC__")
    ['a', 'Aa', 'b', 'Bb', 'CCC']
    >>> camel_case_split("10Camel20CaseXYZ30")
    ['10', 'Camel', '20', 'Case', 'XYZ', '30']
    >>> camel_case_split(" CamelCase camel case ")
    ['Camel', 'Case', 'camel', 'case']
    """
    return list(_camel_case_split_iter(string))

All tests:

@pytest.mark.parametrize(
    "string,expected",
    [
        ("XYZCamelCase", ["XYZ", "Camel", "Case"]),
        ("CamelCaseXYZ", ["Camel", "Case", "XYZ"]),
        ("CamelCaseXYZa", ["Camel", "Case", "XY", "Za"]),
        ("XYZCamelCaseXYZ", ["XYZ", "Camel", "Case", "XYZ"]),
        ("aCamelCaseWordT", ["a", "Camel", "Case", "Word", "T"]),
        ("CamelCaseWordT", ["Camel", "Case", "Word", "T"]),
        ("CamelCaseWordTa", ["Camel", "Case", "Word", "Ta"]),
        ("aCamelCaseWordTa", ["a", "Camel", "Case", "Word", "Ta"]),
        ("Ta", ["Ta"]),
        ("aT", ["a", "T"]),
        ("a", ["a"]),
        ("T", ["T"]),
        ("", []),
        ("A_B", ["A", "B"]),
        ("a_b", ["a", "b"]),
        ("Camel_CaseXYZ", ["Camel", "Case", "XYZ"]),
        ("aAa_bBb", ["a", "Aa", "b", "Bb"]),
        ("aAaTTT_b", ["a", "Aa", "TTT", "b"]),
        ("__CCcCccc__DDD__eee_fGG__", ["C", "Cc", "Cccc", "DDD", "eee", "f", "GG"]),
        ("__a", ["a"]),
        ("__A", ["A"]),
        ("a__", ["a"]),
        ("A__", ["A"]),
        ("____", []),
        ("3DCamelCase", ["3", "D", "Camel", "Case"]),
        ("330DCamelCase", ["330", "D", "Camel", "Case"]),
        ("330CamelCase", ["330", "Camel", "Case"]),
        ("Camel5Case", ["Camel", "5", "Case"]),
        ("Camel50Case", ["Camel", "50", "Case"]),
        ("Camel501Case", ["Camel", "501", "Case"]),
        ("CamelCase501", ["Camel", "Case", "501"]),
        ("CamelCaseA501", ["Camel", "Case", "A", "501"]),
        ("CamelCaseAA501", ["Camel", "Case", "AA", "501"]),
        ("CamelCase501a", ["Camel", "Case", "501", "a"]),
        ("Camel5Case5D", ["Camel", "5", "Case", "5", "D"]),
        ("Camel5Case50DC", ["Camel", "5", "Case", "50", "DC"]),
        ("Camel5Case50DCCase", ["Camel", "5", "Case", "50", "DC", "Case"]),
        ("camel.case", ["camel", "case"]),
        ("Camel Case XYZ", ["Camel", "Case", "XYZ"]),
        (" Camel Case 1 3XYZ _ AA ", ["Camel", "Case", "1", "3", "XYZ", "AA"]),
        ("camel\ncase", ["camel", "case"]),
    ],
)
def test_camel_case_split(string, expected):
    res = camel_case_split(string)
    assert res == expected

But I believe @mnesarco's answer is also very good, it's X5 faster and behaves almost the same.

The only difference (that I know) is how numbers with uppercase are handled:

"3DAndD3ARESoComplicated" -> 
# My answer:
['3', 'D', 'And', 'D', '3', 'ARE', 'So', 'Complicated'] 
# mnesarco's answer:
['3D', 'And', 'D3ARE', 'So', 'Complicated'] 
Noam Nol
  • 570
  • 4
  • 11
0

I know that the question added the tag of regex. But still, I always try to stay as far away from regex as possible. So, here is my solution without regex:

def split_camel(text, char):
    if len(text) <= 1: # To avoid adding a wrong space in the beginning
        return text+char
    if char.isupper() and text[-1].islower(): # Regular Camel case
        return text + " " + char
    elif text[-1].isupper() and char.islower() and text[-2] != " ": # Detect Camel case in case of abbreviations
        return text[:-1] + " " + text[-1] + char
    else: # Do nothing part
        return text + char

text = "PathURLFinder"
text = reduce(split_camel, a, "")
print text
# prints "Path URL Finder"
print text.split(" ")
# prints "['Path', 'URL', 'Finder']"

EDIT: As suggested, here is the code to put the functionality in a single function.

def split_camel(text):
    def splitter(text, char):
        if len(text) <= 1: # To avoid adding a wrong space in the beginning
            return text+char
        if char.isupper() and text[-1].islower(): # Regular Camel case
            return text + " " + char
        elif text[-1].isupper() and char.islower() and text[-2] != " ": # Detect Camel case in case of abbreviations
            return text[:-1] + " " + text[-1] + char
        else: # Do nothing part
            return text + char
    converted_text = reduce(splitter, text, "")
    return converted_text.split(" ")

split_camel("PathURLFinder")
# prints ['Path', 'URL', 'Finder']
thiruvenkadam
  • 4,170
  • 4
  • 27
  • 26
0

Putting a more comprehensive approach otu ther. It takes care of several issues like numbers, string starting with lower case, single letter words etc.

def camel_case_split(identifier, remove_single_letter_words=False):
    """Parses CamelCase and Snake naming"""
    concat_words = re.split('[^a-zA-Z]+', identifier)

    def camel_case_split(string):
        bldrs = [[string[0].upper()]]
        string = string[1:]
        for idx, c in enumerate(string):
            if bldrs[-1][-1].islower() and c.isupper():
                bldrs.append([c])
            elif c.isupper() and (idx+1) < len(string) and string[idx+1].islower():
                bldrs.append([c])
            else:
                bldrs[-1].append(c)

        words = [''.join(bldr) for bldr in bldrs]
        words = [word.lower() for word in words]
        return words
    words = []
    for word in concat_words:
        if len(word) > 0:
            words.extend(camel_case_split(word))
    if remove_single_letter_words:
        subset_words = []
        for word in words:
            if len(word) > 1:
                subset_words.append(word)
        if len(subset_words) > 0:
            words = subset_words
    return words
datarpit
  • 21
  • 2
  • Could you add more comments to the code, so a person not well-versed in python will have it easier to understand what it does? – AplusKminus Sep 02 '19 at 08:48
0

My requirement was a bit more specific than the OP. In particular, in addition to handling all OP cases, I needed the following which the other solutions do not provide: - treat all non-alphanumeric input (e.g. !@#$%^&*() etc) as a word separator - handle digits as follows: - cannot be in the middle of a word - cannot be at the beginning of the word unless the phrase starts with a digit

def splitWords(s):
    new_s = re.sub(r'[^a-zA-Z0-9]', ' ',                  # not alphanumeric
        re.sub(r'([0-9]+)([^0-9])', '\\1 \\2',            # digit followed by non-digit
            re.sub(r'([a-z])([A-Z])','\\1 \\2',           # lower case followed by upper case
                re.sub(r'([A-Z])([A-Z][a-z])', '\\1 \\2', # upper case followed by upper case followed by lower case
                    s
                )
            )
        )
    )
    return [x for x in new_s.split(' ') if x]

Output:

for test in ['', ' ', 'lower', 'UPPER', 'Initial', 'dromedaryCase', 'CamelCase', 'ABCWordDEF', 'CamelCaseXYZand123.how23^ar23e you doing AndABC123XYZdf']:
    print test + ':' + str(splitWords(test))
:[]
 :[]
lower:['lower']
UPPER:['UPPER']
Initial:['Initial']
dromedaryCase:['dromedary', 'Case']
CamelCase:['Camel', 'Case']
ABCWordDEF:['ABC', 'Word', 'DEF']
CamelCaseXYZand123.how23^ar23e you doing AndABC123XYZdf:['Camel', 'Case', 'XY', 'Zand123', 'how23', 'ar23', 'e', 'you', 'doing', 'And', 'ABC123', 'XY', 'Zdf']
mwag
  • 3,557
  • 31
  • 38
0

Maybe this will be enough to for some people:

a = "SomeCamelTextUpper"
def camelText(val):
    return ''.join([' ' + i if i.isupper() else i for i in val]).strip()
print(camelText(a))

It dosen't work with the type "CamelXYZ", but with 'typical' CamelCase scenario should work just fine.

-2

I think below is the optimim

Def count_word(): Return(re.findall(‘[A-Z]?[a-z]+’, input(‘please enter your string’))

Print(count_word())