140

What is the pythonic way to split a string before the occurrences of a given set of characters?

For example, I want to split 'TheLongAndWindingRoad' at any occurrence of an uppercase letter (possibly except the first), and obtain ['The', 'Long', 'And', 'Winding', 'Road'].

Edit: It should also split single occurrences, i.e. from 'ABC' I'd like to obtain ['A', 'B', 'C'].

Federico A. Ramponi
  • 46,145
  • 29
  • 109
  • 133

22 Answers22

195

Unfortunately it's not possible to split on a zero-width match in Python. But you can use re.findall instead:

>>> import re
>>> re.findall('[A-Z][^A-Z]*', 'TheLongAndWindingRoad')
['The', 'Long', 'And', 'Winding', 'Road']
>>> re.findall('[A-Z][^A-Z]*', 'ABC')
['A', 'B', 'C']
Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
47

Here is an alternative regex solution. The problem can be reprased as "how do I insert a space before each uppercase letter, before doing the split":

>>> s = "TheLongAndWindingRoad ABC A123B45"
>>> re.sub( r"([A-Z])", r" \1", s).split()
['The', 'Long', 'And', 'Winding', 'Road', 'A', 'B', 'C', 'A123', 'B45']

This has the advantage of preserving all non-whitespace characters, which most other solutions do not.

Dave Kirby
  • 25,806
  • 5
  • 67
  • 84
28

Use a lookahead and a lookbehind:

In Python 3.7, you can do this:

re.split('(?<=.)(?=[A-Z])', 'TheLongAndWindingRoad')

And it yields:

['The', 'Long', 'And', 'Winding', 'Road']

You need the look-behind to avoid an empty string at the beginning.

Endlisnis
  • 481
  • 4
  • 9
21
>>> import re
>>> re.findall('[A-Z][a-z]*', 'TheLongAndWindingRoad')
['The', 'Long', 'And', 'Winding', 'Road']

>>> re.findall('[A-Z][a-z]*', 'SplitAString')
['Split', 'A', 'String']

>>> re.findall('[A-Z][a-z]*', 'ABC')
['A', 'B', 'C']

If you want "It'sATest" to split to ["It's", 'A', 'Test'] change the rexeg to "[A-Z][a-z']*"

John La Rooy
  • 295,403
  • 53
  • 369
  • 502
16

A variation on @ChristopheD 's solution

s = 'TheLongAndWindingRoad'

pos = [i for i,e in enumerate(s+'A') if e.isupper()]
parts = [s[pos[j]:pos[j+1]] for j in xrange(len(pos)-1)]

print parts
pwdyson
  • 1,177
  • 7
  • 14
13

I think that a better answer might be to split the string up into words that do not end in a capital. This would handle the case where the string doesn't start with a capital letter.

 re.findall('.[^A-Z]*', 'aboutTheLongAndWindingRoad')

example:

>>> import re
>>> re.findall('.[^A-Z]*', 'aboutTheLongAndWindingRoadABC')
['about', 'The', 'Long', 'And', 'Winding', 'Road', 'A', 'B', 'C']
shrewmouse
  • 5,338
  • 3
  • 38
  • 43
7
import re
filter(None, re.split("([A-Z][^A-Z]*)", "TheLongAndWindingRoad"))

or

[s for s in re.split("([A-Z][^A-Z]*)", "TheLongAndWindingRoad") if s]
Gabe
  • 84,912
  • 12
  • 139
  • 238
  • 1
    The filter is totally unnecessary and buys you nothing over a direct regex split with capture group: `[s for s in re.compile(r"([A-Z][^A-Z]*)").split( "TheLongAndWindingRoad") if s]` giving `['The', 'Long', 'And', 'Winding', 'Road']` – smci Jun 29 '13 at 22:15
  • 1
    @smci: This usage of `filter` is the same as the list comprehension with a condition. Do you have anything against it? – Gabe Jun 30 '13 at 04:18
  • 1
    I know it can be replaced with a list comprehension with a condition, because I just posted that code, then you copied it. Here are three reasons the list comprehension is preferable: a) *Legible idiom:* list comprehensions are a more Pythonic idiom and read clearer left-to-right than `filter(lambdaconditionfunc, ...)` b) in Python 3, `filter()` returns an iterator. So they will not be totally equivalent. c) I expect `filter()` is slower too – smci Jul 01 '13 at 08:17
7

Pythonic way could be:

"".join([(" "+i if i.isupper() else i) for i in 'TheLongAndWindingRoad']).strip().split()
['The', 'Long', 'And', 'Winding', 'Road']

Works good for Unicode, avoiding re/re2.

"".join([(" "+i if i.isupper() else i) for i in 'СуперМаркетыПродажаКлиент']).strip().split()
['Супер', 'Маркеты', 'Продажа', 'Клиент']
user12114088
  • 71
  • 1
  • 3
5
src = 'TheLongAndWindingRoad'
glue = ' '

result = ''.join(glue + x if x.isupper() else x for x in src).strip(glue).split(glue)
user3726655
  • 51
  • 1
  • 4
5

Another without regex and the ability to keep contiguous uppercase if wanted

def split_on_uppercase(s, keep_contiguous=False):
    """

    Args:
        s (str): string
        keep_contiguous (bool): flag to indicate we want to 
                                keep contiguous uppercase chars together

    Returns:

    """

    string_length = len(s)
    is_lower_around = (lambda: s[i-1].islower() or 
                       string_length > (i + 1) and s[i + 1].islower())

    start = 0
    parts = []
    for i in range(1, string_length):
        if s[i].isupper() and (not keep_contiguous or is_lower_around()):
            parts.append(s[start: i])
            start = i
    parts.append(s[start:])

    return parts

>>> split_on_uppercase('theLongWindingRoad')
['the', 'Long', 'Winding', 'Road']
>>> split_on_uppercase('TheLongWindingRoad')
['The', 'Long', 'Winding', 'Road']
>>> split_on_uppercase('TheLongWINDINGRoadT', True)
['The', 'Long', 'WINDING', 'Road', 'T']
>>> split_on_uppercase('ABC')
['A', 'B', 'C']
>>> split_on_uppercase('ABCD', True)
['ABCD']
>>> split_on_uppercase('')
['']
>>> split_on_uppercase('hello world')
['hello world']
Totoro
  • 867
  • 9
  • 10
3

Alternative solution (if you dislike explicit regexes):

s = 'TheLongAndWindingRoad'

pos = [i for i,e in enumerate(s) if e.isupper()]

parts = []
for j in xrange(len(pos)):
    try:
        parts.append(s[pos[j]:pos[j+1]])
    except IndexError:
        parts.append(s[pos[j]:])

print parts
ChristopheD
  • 112,638
  • 29
  • 165
  • 179
3

Replace every uppercase letter 'L' in the given with an empty space plus that letter " L". We can do this using list comprehension or we can define a function to do it as follows.

s = 'TheLongANDWindingRoad ABC A123B45'
''.join([char if (char.islower() or not char.isalpha()) else ' '+char for char in list(s)]).strip().split()
>>> ['The', 'Long', 'A', 'N', 'D', 'Winding', 'Road', 'A', 'B', 'C', 'A123', 'B45']

If you choose to go by a function, here is how.

def splitAtUpperCase(text):
    result = ""
    for char in text:
        if char.isupper():
            result += " " + char
        else:
            result += char
    return result.split()

In the case of the given example:

print(splitAtUpperCase('TheLongAndWindingRoad')) 
>>>['The', 'Long', 'A', 'N', 'D', 'Winding', 'Road']

But most of the time that we are splitting a sentence at upper case letters, it is usually the case that we want to maintain abbreviations that are typically a continuous stream of uppercase letters. The code below would help.

def splitAtUpperCase(s):
    for i in range(len(s)-1)[::-1]:
        if s[i].isupper() and s[i+1].islower():
            s = s[:i]+' '+s[i:]
        if s[i].isupper() and s[i-1].islower():
            s = s[:i]+' '+s[i:]
    return s.split()

splitAtUpperCase('TheLongANDWindingRoad')

>>> ['The', 'Long', 'AND', 'Winding', 'Road']

Thanks.

Samuel Nde
  • 2,565
  • 2
  • 23
  • 23
  • @MarkByers I do not know why someone down voted my answer but I would love you to take a look at it for me. I would appreciate your feedback. – Samuel Nde Apr 04 '19 at 17:55
1

An alternative way without using regex or enumerate:

word = 'TheLongAndWindingRoad'
list = [x for x in word]

for char in list:
    if char != list[0] and char.isupper():
        list[list.index(char)] = ' ' + char

fin_list = ''.join(list).split(' ')

I think it is clearer and simpler without chaining too many methods or using a long list comprehension that can be difficult to read.

Pandemonium
  • 7,724
  • 3
  • 32
  • 51
1

This is possible with the more_itertools.split_before tool.

import more_itertools as mit


iterable = "TheLongAndWindingRoad"
[ "".join(i) for i in mit.split_before(iterable, pred=lambda s: s.isupper())]
# ['The', 'Long', 'And', 'Winding', 'Road']

It should also split single occurrences, i.e. from 'ABC' I'd like to obtain ['A', 'B', 'C'].

iterable = "ABC"
[ "".join(i) for i in mit.split_before(iterable, pred=lambda s: s.isupper())]
# ['A', 'B', 'C']

more_itertools is a third-party package with 60+ useful tools including implementations for all of the original itertools recipes, which obviates their manual implementation.

pylang
  • 40,867
  • 14
  • 129
  • 121
0

An alternate way using enumerate and isupper()

Code:

strs = 'TheLongAndWindingRoad'
ind =0
count =0
new_lst=[]
for index, val in enumerate(strs[1:],1):
    if val.isupper():
        new_lst.append(strs[ind:index])
        ind=index
if ind<len(strs):
    new_lst.append(strs[ind:])
print new_lst

Output:

['The', 'Long', 'And', 'Winding', 'Road']
The6thSense
  • 8,103
  • 8
  • 31
  • 65
0

Sharing what came to mind when I read the post. Different from other posts.

strs = 'TheLongAndWindingRoad'

# grab index of uppercase letters in strs
start_idx = [i for i,j in enumerate(strs) if j.isupper()]

# create empty list
strs_list = []

# initiate counter
cnt = 1

for pos in start_idx:
    start_pos = pos

    # use counter to grab next positional element and overlook IndexeError
    try:
        end_pos = start_idx[cnt]
    except IndexError:
        continue

    # append to empty list
    strs_list.append(strs[start_pos:end_pos])

    cnt += 1
Do L.
  • 1
  • 2
0

You might also wanna do it this way

def camelcase(s):
    
    words = []
    
    for char in s:
        if char.isupper():
            words.append(':'+char)
        else:
            words.append(char)
    words = ((''.join(words)).split(':'))
    
    return len(words)

This will output as follows

s = 'oneTwoThree'
print(camecase(s)
//['one', 'Two', 'Three']
Muteshi
  • 820
  • 1
  • 10
  • 25
0
def solution(s):
   
    st = ''
    for c in s:
        if c == c.upper():
            st += ' '   
        st += c    
       
    return st
12944qwerty
  • 2,001
  • 1
  • 10
  • 30
0

I'm using list

def split_by_upper(x): 
i = 0       
lis = list(x)
while True:
    if i == len(lis)-1:
        if lis[i].isupper():
            lis.insert(i,",")
        break
    if lis[i].isupper() and i != 0:
        lis.insert(i,",")
        i+=1
    i+=1
return "".join(lis).split(",")

OUTPUT:

data = "TheLongAndWindingRoad"
print(split_by_upper(data))`
>> ['The', 'Long', 'And', 'Winding', 'Road']
0

My solution for splitting on capitalized letters - keeps capitalized words

text = 'theLongAndWindingRoad ABC'
result = re.sub('(?<=.)(?=[A-Z][a-z])', r" ", text).split()
print(result)
#['the', 'Long', 'And', 'Winding', 'Road', 'ABC']
allMeow
  • 1
  • 1
  • 1
  • This doesn't actually answer the question, the desired result was a list of strings, not a string with spaces inserted. – cafce25 Nov 16 '22 at 20:20
0

Little late in the party, but:

In [1]: camel = "CamelCaseConfig"
In [2]: parts = "".join([
    f"|{c}" if c.isupper() else c
    for c in camel
]).lstrip("|").split("|")
In [3]: screaming_snake = "_".join([
    part.upper()
    for part in parts
])
In [4]: screaming_snake
Out[4]: 'CAMEL_CASE_CONFIG'

part of my answer is based on other people answer from here

rodfersou
  • 914
  • 6
  • 10
0

enter image description here

def split_string_after_upper_case(word):

    word_lst = [x for x in word]
    index = 0
    for char in word[1:]:
        index += 1
        if char.isupper():
            word_lst.insert(index, ' ')
            index += 1
    return ''.join(word_lst).split(" ")

k = split_string_after_upper_case('TheLongAndWindingRoad')
print(k)
Julia Meshcheryakova
  • 3,162
  • 3
  • 22
  • 42
CristianG
  • 21
  • 3