Split a string at uppercase letters

Question

What is the pythonic way to split a string before the occurrences of a given set of characters?

For example, I want to split 'TheLongAndWindingRoad' at any occurrence of an uppercase letter (possibly except the first), and obtain ['The', 'Long', 'And', 'Winding', 'Road'].

Edit: It should also split single occurrences, i.e. from 'ABC' I'd like to obtain ['A', 'B', 'C'].

Mark Byers · Accepted Answer · 2010-02-17T00:22:52.513

195

Unfortunately it's not possible to split on a zero-width match in Python. But you can use re.findall instead:

>>> import re
>>> re.findall('[A-Z][^A-Z]*', 'TheLongAndWindingRoad')
['The', 'Long', 'And', 'Winding', 'Road']
>>> re.findall('[A-Z][^A-Z]*', 'ABC')
['A', 'B', 'C']

edited Feb 17 '10 at 00:22

answered Feb 17 '10 at 00:04

Mark Byers

811,555
193
1,581
1,452

17

Beware that this will drop any characters before the first capital character. 'theLongAndWindingRoad' would result in ['Long', 'And', 'Winding', 'Road'] – Marc Schulder Jul 14 '16 at 13:44
25

@MarcSchulder: If you need that case, just use `'[a-zA-Z][^A-Z]*'` as the regex. – knub Feb 10 '17 at 14:01
It is possible to do te same without upercase ? – Laurent Cesaro Apr 20 '18 at 09:07
4

In order to split lower camel case words `print(re.findall('^[a-z]+|[A-Z][^A-Z]*', 'theLongAndWindingRoad'))` – Ulysses May 01 '18 at 08:44
2

'ThatLeadsToYourDooooor' <3 – Ulf Gjerdingen Dec 13 '21 at 19:49
It is possible to split on a zero width match from 3.7 – Bharel Mar 15 '22 at 13:19
Is there a solution with ignores abbreviations i.e. where all the letters are Upper case Eg: ABCDE – Yaser Sakkaf Sep 20 '22 at 05:19

Dave Kirby · Answer 2 · 2010-02-17T08:40:41.307

47

Here is an alternative regex solution. The problem can be reprased as "how do I insert a space before each uppercase letter, before doing the split":

>>> s = "TheLongAndWindingRoad ABC A123B45"
>>> re.sub( r"([A-Z])", r" \1", s).split()
['The', 'Long', 'And', 'Winding', 'Road', 'A', 'B', 'C', 'A123', 'B45']

This has the advantage of preserving all non-whitespace characters, which most other solutions do not.

edited Feb 17 '10 at 08:40

answered Feb 17 '10 at 08:19

Dave Kirby

25,806
5
67
84

Can you please explain why does the space before \1 work? Is it because of the split method or is it anything related to regex? – Lax_Sam Dec 29 '18 at 10:32
split delimiter defaults to any whitespace string – CIsForCookies Jul 15 '20 at 16:42
@Lax_Sam the regex substitution just adds a space before any capital letter, and split() picks it up – vitaly Oct 23 '20 at 06:24
I am always inspired when an intractable problem transforms into a no-brainer when rephrased. – Tony Oct 04 '22 at 16:54

Endlisnis · Answer 3 · 2022-10-30T01:42:03.620

28

Use a lookahead and a lookbehind:

In Python 3.7, you can do this:

re.split('(?<=.)(?=[A-Z])', 'TheLongAndWindingRoad')

And it yields:

['The', 'Long', 'And', 'Winding', 'Road']

You need the look-behind to avoid an empty string at the beginning.

edited Oct 30 '22 at 01:42

answered Apr 19 '19 at 19:25

Endlisnis

481
4
9

1

It will yield an empty string. re.split('(?=[A-Z])', 'ABC') get ['', 'A', 'B', 'C'] – Ben Oct 28 '22 at 23:34
@Ben: Yes, you're right. I've updated my answer to avoid that. – Endlisnis Oct 30 '22 at 01:43

John La Rooy · Answer 4 · 2010-02-17T00:31:10.937

21

>>> import re
>>> re.findall('[A-Z][a-z]*', 'TheLongAndWindingRoad')
['The', 'Long', 'And', 'Winding', 'Road']

>>> re.findall('[A-Z][a-z]*', 'SplitAString')
['Split', 'A', 'String']

>>> re.findall('[A-Z][a-z]*', 'ABC')
['A', 'B', 'C']

If you want "It'sATest" to split to ["It's", 'A', 'Test'] change the rexeg to "[A-Z][a-z']*"

edited Feb 17 '10 at 00:31

answered Feb 17 '10 at 00:14

John La Rooy

295,403
53
369
502

+1: For first to get ABC working. I've also updated my answer now. – Mark Byers Feb 17 '10 at 00:19
>>> re.findall('[A-Z][a-z]*', "It's about 70% of the Economy") -----> ['It', 'Economy'] – ChristopheD Feb 17 '10 at 00:50
@ChristopheD. The OP doesn't say how to non-alpha characters should be treated. – John La Rooy Feb 17 '10 at 01:00
1

true, but this current regex way also `drops` all regular (just plain alpha) words that do not start with an uppercase letter. I doubt that that was the intention of the OP. – ChristopheD Feb 17 '10 at 12:21

pwdyson · Answer 5 · 2010-02-17T02:07:56.300

16

A variation on @ChristopheD 's solution

s = 'TheLongAndWindingRoad'

pos = [i for i,e in enumerate(s+'A') if e.isupper()]
parts = [s[pos[j]:pos[j+1]] for j in xrange(len(pos)-1)]

print parts

edited Feb 17 '10 at 02:07

answered Feb 17 '10 at 02:01

pwdyson

1,177
7
14

2

Nice one - this works with non-Latin characters too. The regex solutions shown here do not. – AlexVhr Feb 03 '13 at 07:43
this also returns a list which was what I needed! – JMVDA Sep 30 '21 at 17:56
Note that it should be `range` instead of `xrange`. – raspiduino May 01 '23 at 13:04

score 13 · Answer 6 · answered Nov 25 '19 at 15:19

I think that a better answer might be to split the string up into words that do not end in a capital. This would handle the case where the string doesn't start with a capital letter.

 re.findall('.[^A-Z]*', 'aboutTheLongAndWindingRoad')

example:

>>> import re
>>> re.findall('.[^A-Z]*', 'aboutTheLongAndWindingRoadABC')
['about', 'The', 'Long', 'And', 'Winding', 'Road', 'A', 'B', 'C']

Gabe · Answer 7 · 2013-06-30T04:15:55.700

7

import re
filter(None, re.split("([A-Z][^A-Z]*)", "TheLongAndWindingRoad"))

or

[s for s in re.split("([A-Z][^A-Z]*)", "TheLongAndWindingRoad") if s]

edited Jun 30 '13 at 04:15

answered Feb 17 '10 at 00:07

Gabe

84,912
12
139
238

1

The filter is totally unnecessary and buys you nothing over a direct regex split with capture group: `[s for s in re.compile(r"([A-Z][^A-Z]*)").split( "TheLongAndWindingRoad") if s]` giving `['The', 'Long', 'And', 'Winding', 'Road']` – smci Jun 29 '13 at 22:15
1

@smci: This usage of `filter` is the same as the list comprehension with a condition. Do you have anything against it? – Gabe Jun 30 '13 at 04:18
1

I know it can be replaced with a list comprehension with a condition, because I just posted that code, then you copied it. Here are three reasons the list comprehension is preferable: a) *Legible idiom:* list comprehensions are a more Pythonic idiom and read clearer left-to-right than `filter(lambdaconditionfunc, ...)` b) in Python 3, `filter()` returns an iterator. So they will not be totally equivalent. c) I expect `filter()` is slower too – smci Jul 01 '13 at 08:17

user12114088 · Answer 8 · 2019-09-24T15:38:38.007

7

Pythonic way could be:

"".join([(" "+i if i.isupper() else i) for i in 'TheLongAndWindingRoad']).strip().split()
['The', 'Long', 'And', 'Winding', 'Road']

Works good for Unicode, avoiding re/re2.

"".join([(" "+i if i.isupper() else i) for i in 'СуперМаркетыПродажаКлиент']).strip().split()
['Супер', 'Маркеты', 'Продажа', 'Клиент']

edited Sep 24 '19 at 15:38

answered Sep 24 '19 at 15:33

user12114088

71
1
3

Great way to do it without regex – callmeanythingyouwant Jun 22 '21 at 08:10
Almost feels like it violates some of the python zen, though – rearThing Aug 07 '21 at 19:49

user3726655 · Answer 9 · 2014-07-08T12:31:46.073

5

src = 'TheLongAndWindingRoad'
glue = ' '

result = ''.join(glue + x if x.isupper() else x for x in src).strip(glue).split(glue)

edited Jul 08 '14 at 12:31

answered Jul 07 '14 at 11:04

user3726655

51
1
4

1

Could you please add explanation to why this is good solution to the problem. – Matas Vaitkevicius Jul 07 '14 at 11:22
I'm sorry. I'm forgot last step – user3726655 Jul 08 '14 at 12:34
Seems concise, pythonic and self-explanatory, to me. – Dec 10 '18 at 10:44

Totoro · Answer 10 · 2017-06-02T16:32:19.950

Another without regex and the ability to keep contiguous uppercase if wanted

def split_on_uppercase(s, keep_contiguous=False):
    """

    Args:
        s (str): string
        keep_contiguous (bool): flag to indicate we want to 
                                keep contiguous uppercase chars together

    Returns:

    """

    string_length = len(s)
    is_lower_around = (lambda: s[i-1].islower() or 
                       string_length > (i + 1) and s[i + 1].islower())

    start = 0
    parts = []
    for i in range(1, string_length):
        if s[i].isupper() and (not keep_contiguous or is_lower_around()):
            parts.append(s[start: i])
            start = i
    parts.append(s[start:])

    return parts

>>> split_on_uppercase('theLongWindingRoad')
['the', 'Long', 'Winding', 'Road']
>>> split_on_uppercase('TheLongWindingRoad')
['The', 'Long', 'Winding', 'Road']
>>> split_on_uppercase('TheLongWINDINGRoadT', True)
['The', 'Long', 'WINDING', 'Road', 'T']
>>> split_on_uppercase('ABC')
['A', 'B', 'C']
>>> split_on_uppercase('ABCD', True)
['ABCD']
>>> split_on_uppercase('')
['']
>>> split_on_uppercase('hello world')
['hello world']

score 3 · Answer 11 · answered Feb 17 '10 at 00:37

Alternative solution (if you dislike explicit regexes):

s = 'TheLongAndWindingRoad'

pos = [i for i,e in enumerate(s) if e.isupper()]

parts = []
for j in xrange(len(pos)):
    try:
        parts.append(s[pos[j]:pos[j+1]])
    except IndexError:
        parts.append(s[pos[j]:])

print parts

Samuel Nde · Answer 12 · 2019-04-04T17:52:37.270

Replace every uppercase letter 'L' in the given with an empty space plus that letter " L". We can do this using list comprehension or we can define a function to do it as follows.

s = 'TheLongANDWindingRoad ABC A123B45'
''.join([char if (char.islower() or not char.isalpha()) else ' '+char for char in list(s)]).strip().split()
>>> ['The', 'Long', 'A', 'N', 'D', 'Winding', 'Road', 'A', 'B', 'C', 'A123', 'B45']

If you choose to go by a function, here is how.

def splitAtUpperCase(text):
    result = ""
    for char in text:
        if char.isupper():
            result += " " + char
        else:
            result += char
    return result.split()

In the case of the given example:

print(splitAtUpperCase('TheLongAndWindingRoad')) 
>>>['The', 'Long', 'A', 'N', 'D', 'Winding', 'Road']

But most of the time that we are splitting a sentence at upper case letters, it is usually the case that we want to maintain abbreviations that are typically a continuous stream of uppercase letters. The code below would help.

def splitAtUpperCase(s):
    for i in range(len(s)-1)[::-1]:
        if s[i].isupper() and s[i+1].islower():
            s = s[:i]+' '+s[i:]
        if s[i].isupper() and s[i-1].islower():
            s = s[:i]+' '+s[i:]
    return s.split()

splitAtUpperCase('TheLongANDWindingRoad')

>>> ['The', 'Long', 'AND', 'Winding', 'Road']

Thanks.

@MarkByers I do not know why someone down voted my answer but I would love you to take a look at it for me. I would appreciate your feedback. — Samuel Nde, Apr 04 '19 at 17:55

score 1 · Answer 13 · answered Dec 07 '14 at 06:48

An alternative way without using regex or enumerate:

word = 'TheLongAndWindingRoad'
list = [x for x in word]

for char in list:
    if char != list[0] and char.isupper():
        list[list.index(char)] = ' ' + char

fin_list = ''.join(list).split(' ')

I think it is clearer and simpler without chaining too many methods or using a long list comprehension that can be difficult to read.

pylang · Answer 14 · 2018-12-31T17:29:04.853

This is possible with the more_itertools.split_before tool.

import more_itertools as mit


iterable = "TheLongAndWindingRoad"
[ "".join(i) for i in mit.split_before(iterable, pred=lambda s: s.isupper())]
# ['The', 'Long', 'And', 'Winding', 'Road']

It should also split single occurrences, i.e. from 'ABC' I'd like to obtain ['A', 'B', 'C'].

iterable = "ABC"
[ "".join(i) for i in mit.split_before(iterable, pred=lambda s: s.isupper())]
# ['A', 'B', 'C']

more_itertools is a third-party package with 60+ useful tools including implementations for all of the original itertools recipes, which obviates their manual implementation.

score 0 · Answer 15 · answered Feb 10 '16 at 12:50

An alternate way using enumerate and isupper()

Code:

strs = 'TheLongAndWindingRoad'
ind =0
count =0
new_lst=[]
for index, val in enumerate(strs[1:],1):
    if val.isupper():
        new_lst.append(strs[ind:index])
        ind=index
if ind<len(strs):
    new_lst.append(strs[ind:])
print new_lst

Output:

['The', 'Long', 'And', 'Winding', 'Road']

score 0 · Answer 16 · answered Mar 26 '19 at 20:06

Sharing what came to mind when I read the post. Different from other posts.

strs = 'TheLongAndWindingRoad'

# grab index of uppercase letters in strs
start_idx = [i for i,j in enumerate(strs) if j.isupper()]

# create empty list
strs_list = []

# initiate counter
cnt = 1

for pos in start_idx:
    start_pos = pos

    # use counter to grab next positional element and overlook IndexeError
    try:
        end_pos = start_idx[cnt]
    except IndexError:
        continue

    # append to empty list
    strs_list.append(strs[start_pos:end_pos])

    cnt += 1

score 0 · Answer 17 · answered Dec 30 '20 at 12:49

You might also wanna do it this way

def camelcase(s):
    
    words = []
    
    for char in s:
        if char.isupper():
            words.append(':'+char)
        else:
            words.append(char)
    words = ((''.join(words)).split(':'))
    
    return len(words)

This will output as follows

s = 'oneTwoThree'
print(camecase(s)
//['one', 'Two', 'Three']

score 0 · Answer 18 · edited May 15 '21 at 08:25

0

def solution(s):
   
    st = ''
    for c in s:
        if c == c.upper():
            st += ' '   
        st += c    
       
    return st

edited May 15 '21 at 08:25

12944qwerty

2,001
1
10
30

answered May 15 '21 at 04:22

George Sousa

1

2

This will not split into lists like the question asks for. – 12944qwerty May 15 '21 at 04:51

score 0 · Answer 19 · answered Feb 27 '22 at 15:10

0

I'm using list

def split_by_upper(x): 
i = 0       
lis = list(x)
while True:
    if i == len(lis)-1:
        if lis[i].isupper():
            lis.insert(i,",")
        break
    if lis[i].isupper() and i != 0:
        lis.insert(i,",")
        i+=1
    i+=1
return "".join(lis).split(",")

OUTPUT:

data = "TheLongAndWindingRoad"
print(split_by_upper(data))`
>> ['The', 'Long', 'And', 'Winding', 'Road']

answered Feb 27 '22 at 15:10

Jon Snow

1

very bad code. very very long for no reason. It's like coding in the early 2000 in C – LazerDance Jul 09 '22 at 22:11

allMeow · Answer 20 · 2022-11-18T09:55:13.170

0

My solution for splitting on capitalized letters - keeps capitalized words

text = 'theLongAndWindingRoad ABC'
result = re.sub('(?<=.)(?=[A-Z][a-z])', r" ", text).split()
print(result)
#['the', 'Long', 'And', 'Winding', 'Road', 'ABC']

edited Nov 18 '22 at 09:55

answered Nov 14 '22 at 08:59

allMeow

1
1
1

This doesn't actually answer the question, the desired result was a list of strings, not a string with spaces inserted. – cafce25 Nov 16 '22 at 20:20

score 0 · Answer 21 · answered Nov 22 '22 at 08:04

Little late in the party, but:

In [1]: camel = "CamelCaseConfig"
In [2]: parts = "".join([
    f"|{c}" if c.isupper() else c
    for c in camel
]).lstrip("|").split("|")
In [3]: screaming_snake = "_".join([
    part.upper()
    for part in parts
])
In [4]: screaming_snake
Out[4]: 'CAMEL_CASE_CONFIG'

part of my answer is based on other people answer from here

score 0 · Answer 22 · edited Jan 11 '23 at 20:09

0

enter image description here

def split_string_after_upper_case(word):

    word_lst = [x for x in word]
    index = 0
    for char in word[1:]:
        index += 1
        if char.isupper():
            word_lst.insert(index, ' ')
            index += 1
    return ''.join(word_lst).split(" ")

k = split_string_after_upper_case('TheLongAndWindingRoad')
print(k)

edited Jan 11 '23 at 20:09

Julia Meshcheryakova

3,162
3
22
42

answered Jan 09 '23 at 12:09

CristianG

21
3

Split a string at uppercase letters

22 Answers22

Linked

Related