In Python, how do I split a string and keep the separators?

Question

Here's the simplest way to explain this. Here's what I'm using:

re.split('\W', 'foo/bar spam\neggs')
>>> ['foo', 'bar', 'spam', 'eggs']

Here's what I want:

someMethod('\W', 'foo/bar spam\neggs')
>>> ['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']

The reason is that I want to split a string into tokens, manipulate it, then put it back together again.

A _non-word_ character [see here for details](https://docs.python.org/2/library/re.html#regular-expression-syntax) — Russell, Dec 02 '15 at 21:27
For the question applied to a raw byte string and put down to "Split a string and keep the delimiters as part of the split string chunks, not as separate list elements", see https://stackoverflow.com/questions/62591863/split-a-string-and-keep-the-delimiters-as-part-of-the-split-string-chunks-not-a?noredirect=1#comment110690060_62591863 — questionto42, Jun 28 '20 at 12:20

score 452 · Accepted Answer · edited Nov 14 '22 at 13:17

452

The docs of re.split mention:

Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

So you just need to wrap your separator with a capturing group:

>>> re.split('(\W)', 'foo/bar spam\neggs')
['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']

edited Nov 14 '22 at 13:17

Tomerikoo

18,379
16
47
61

answered Jan 25 '10 at 23:45

Commodore Jaeger

32,280
4
54
44

33

That's cool. I didn't know re.split did that with capture groups. – Laurence Gonsalves Jan 25 '10 at 23:48
28

@Laurence: Well, it's documented: http://docs.python.org/library/re.html#re.split: "Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list." – Vinay Sajip Jan 25 '10 at 23:54
53

It's seriously underdocumented. I've been using Python for 14 years and only just found this out. – smci Jun 19 '13 at 16:33
1

It's also possible to [escape the special characters in a string](http://stackoverflow.com/questions/280435/escaping-regex-string-in-python), which makes it easier to generate a regular expression that matches a list of strings. – Anderson Green Mar 04 '14 at 04:00
33

Is there an option so that the output of the group match is attached to whatever is on the left (or analogously right) of the split? For example, can this be easily modified so the output is `['foo', '/bar', ' spam', '\neggs']`? – ely Feb 09 '15 at 02:24
4

@Mr.F You might be able to do something with re.sub. I wanted to split on a ending percent so I just subbed in a double character and then split, hacky but worked for my case: `re.split('% ', re.sub('% ', '%% ', '5.000% Additional Whatnot'))` --> `['5.000%', 'Additional Whatnot']` – Kyle James Walker Oct 11 '15 at 22:41
Best make it `re.split(r'(\W)', 'foo/bar spam\neggs')` to avoid the DeprecationWarning (see, e.g., https://stackoverflow.com/questions/50504500/deprecationwarning-invalid-escape-sequence-what-to-use-instead-of-d) – beldaz Feb 01 '21 at 02:35
1

Great. If you want alternating tokens and separators, as you usually do, it would of course be better to use `\W+`. The algorithm behind `split` also seems to indicate whether your list begins/ends with a token or a separator: if starting with a separator it prepends an empty string as the first element. If ending with a separator it adds an empty string as the last element. Useful. – mike rodent Apr 26 '21 at 06:35

score 51 · Answer 2 · answered May 17 '16 at 19:20

If you are splitting on newline, use splitlines(True).

>>> 'line 1\nline 2\nline without newline'.splitlines(True)
['line 1\n', 'line 2\n', 'line without newline']

(Not a general solution, but adding this here in case someone comes here not realizing this method existed.)

score 34 · Answer 3 · answered May 29 '18 at 04:05

34

another example, split on non alpha-numeric and keep the separators

import re
a = "foo,bar@candy*ice%cream"
re.split('([^a-zA-Z0-9])',a)

output:

['foo', ',', 'bar', '@', 'candy', '*', 'ice', '%', 'cream']

explanation

re.split('([^a-zA-Z0-9])',a)

() <- keep the separators
[] <- match everything in between
^a-zA-Z0-9 <-except alphabets, upper/lower and numbers.

answered May 29 '18 at 04:05

anurag

560
6
13

1

Even though, as the [docs](https://docs.python.org/3.7/library/re.html#regular-expression-syntax) say, this is equivalent to the accepted answer, I like this version's readability--even though `\W` is a more compact way to express it. – ephsmith Oct 17 '18 at 01:03
I like the readability of this as well, plus you can customize it if you want to include/exclude some chars! – tikka Dec 11 '20 at 15:04
it's also better for "crazy" langauges which use punctuations as part of a word. Some Hebrew words has ' or " built in (כפר אז"ר, ג'ירף), which requires special treatment. – Berry Tsakala May 06 '21 at 15:57

score 18 · Answer 4 · answered Jul 02 '17 at 11:04

18

If you have only 1 separator, you can employ list comprehensions:

text = 'foo,bar,baz,qux'  
sep = ','

Appending/prepending separator:

result = [x+sep for x in text.split(sep)]
#['foo,', 'bar,', 'baz,', 'qux,']
# to get rid of trailing
result[-1] = result[-1].strip(sep)
#['foo,', 'bar,', 'baz,', 'qux']

result = [sep+x for x in text.split(sep)]
#[',foo', ',bar', ',baz', ',qux']
# to get rid of trailing
result[0] = result[0].strip(sep)
#['foo', ',bar', ',baz', ',qux']

Separator as it's own element:

result = [u for x in text.split(sep) for u in (x, sep)]
#['foo', ',', 'bar', ',', 'baz', ',', 'qux', ',']
results = result[:-1]   # to get rid of trailing

answered Jul 02 '17 at 11:04

Granitosaurus

20,530
5
57
82

2

you can also add in `if x` to ensure that the chunk produced by `split` has some content, i.e. `result = [x + sep for x in text.split(sep) if x]` – i alarmed alien May 08 '20 at 15:12
For me, strip removed too much and I had to use this: `result = [sep+x for x in data.split(sep)]` `result[0] = result[0][len(sep):]` – scottlittle Jun 10 '20 at 00:49

score 12 · Answer 5 · answered Dec 06 '15 at 17:35

Another no-regex solution that works well on Python 3

# Split strings and keep separator
test_strings = ['<Hello>', 'Hi', '<Hi> <Planet>', '<', '']

def split_and_keep(s, sep):
   if not s: return [''] # consistent with string.split()

   # Find replacement character that is not used in string
   # i.e. just use the highest available character plus one
   # Note: This fails if ord(max(s)) = 0x10FFFF (ValueError)
   p=chr(ord(max(s))+1) 

   return s.replace(sep, sep+p).split(p)

for s in test_strings:
   print(split_and_keep(s, '<'))


# If the unicode limit is reached it will fail explicitly
unicode_max_char = chr(1114111)
ridiculous_string = '<Hello>'+unicode_max_char+'<World>'
print(split_and_keep(ridiculous_string, '<'))

score 6 · Answer 6 · answered Aug 22 '18 at 02:18

6

One Lazy and Simple Solution

Assume your regex pattern is split_pattern = r'(!|\?)'

First, you add some same character as the new separator, like '[cut]'

new_string = re.sub(split_pattern, '\\1[cut]', your_string)

Then you split the new separator, new_string.split('[cut]')

answered Aug 22 '18 at 02:18

Yilei Wang

79
1
2

This approach is clever, but will fail when the original string already contains `[cut]` somewhere. – Matthijs Kooijman Sep 19 '19 at 09:27
It could be faster on large scale problems as it finally uses string.split(), in case that re.split() costs more than re.sub() with string.split() (which I do not know). – questionto42 Jun 26 '20 at 12:43

score 5 · Answer 7 · edited Aug 27 '20 at 18:49

You can also split a string with an array of strings instead of a regular expression, like this:

def tokenizeString(aString, separators):
    #separators is an array of strings that are being used to split the string.
    #sort separators in order of descending length
    separators.sort(key=len)
    listToReturn = []
    i = 0
    while i < len(aString):
        theSeparator = ""
        for current in separators:
            if current == aString[i:i+len(current)]:
                theSeparator = current
        if theSeparator != "":
            listToReturn += [theSeparator]
            i = i + len(theSeparator)
        else:
            if listToReturn == []:
                listToReturn = [""]
            if(listToReturn[-1] in separators):
                listToReturn += [""]
            listToReturn[-1] += aString[i]
            i += 1
    return listToReturn
    

print(tokenizeString(aString = "\"\"\"hi\"\"\" hello + world += (1*2+3/5) '''hi'''", separators = ["'''", '+=', '+', "/", "*", "\\'", '\\"', "-=", "-", " ", '"""', "(", ")"]))

orestisf · Answer 8 · 2020-09-18T06:29:07.483

Here is a simple .split solution that works without regex.

This is an answer for Python split() without removing the delimiter, so not exactly what the original post asks but the other question was closed as a duplicate for this one.

def splitkeep(s, delimiter):
    split = s.split(delimiter)
    return [substr + delimiter for substr in split[:-1]] + [split[-1]]

Random tests:

import random

CHARS = [".", "a", "b", "c"]
assert splitkeep("", "X") == [""]  # 0 length test
for delimiter in ('.', '..'):
    for _ in range(100000):
        length = random.randint(1, 50)
        s = "".join(random.choice(CHARS) for _ in range(length))
        assert "".join(splitkeep(s, delimiter)) == s

regex should be avoided on large scale problems for speed reasons, that is why this is a good hint. — questionto42, Jun 26 '20 at 12:40

score 4 · Answer 9 · answered Apr 13 '20 at 01:44

4

replace all seperator: (\W) with seperator + new_seperator: (\W;)
split by the new_seperator: (;)

def split_and_keep(seperator, s):
  return re.split(';', re.sub(seperator, lambda match: match.group() + ';', s))

print('\W', 'foo/bar spam\neggs')

answered Apr 13 '20 at 01:44

kobako

636
5
11

Yeah this is better, although switching the order of when `';'` is appended worked. – Paul Carlton Mar 14 '21 at 17:21

score 3 · Answer 10 · answered Nov 30 '15 at 17:49

# This keeps all separators  in result 
##########################################################################
import re
st="%%(c+dd+e+f-1523)%%7"
sh=re.compile('[\+\-//\*\<\>\%\(\)]')

def splitStringFull(sh, st):
   ls=sh.split(st)
   lo=[]
   start=0
   for l in ls:
     if not l : continue
     k=st.find(l)
     llen=len(l)
     if k> start:
       tmp= st[start:k]
       lo.append(tmp)
       lo.append(l)
       start = k + llen
     else:
       lo.append(l)
       start =llen
   return lo
  #############################

li= splitStringFull(sh , st)
['%%(', 'c', '+', 'dd', '+', 'e', '+', 'f', '-', '1523', ')%%', '7']

score 2 · Answer 11 · answered Aug 26 '16 at 13:56

If one wants to split string while keeping separators by regex without capturing group:

def finditer_with_separators(regex, s):
    matches = []
    prev_end = 0
    for match in regex.finditer(s):
        match_start = match.start()
        if (prev_end != 0 or match_start > 0) and match_start != prev_end:
            matches.append(s[prev_end:match.start()])
        matches.append(match.group())
        prev_end = match.end()
    if prev_end < len(s):
        matches.append(s[prev_end:])
    return matches

regex = re.compile(r"[\(\)]")
matches = finditer_with_separators(regex, s)

If one assumes that regex is wrapped up into capturing group:

def split_with_separators(regex, s):
    matches = list(filter(None, regex.split(s)))
    return matches

regex = re.compile(r"([\(\)])")
matches = split_with_separators(regex, s)

Both ways also will remove empty groups which are useless and annoying in most of the cases.

This ended up working perfectly for me. Thanks for your contribution! — Nick Saccente, Jan 07 '22 at 15:40

score 2 · Answer 12 · edited Jun 12 '21 at 09:13

2

install wrs "WITHOUT REMOVING SPLITOR" BY DOING

pip install wrs

(developed by Rao Hamza)

import wrs
text  = "Now inbox “how to make spam ad” Invest in hard email marketing."
splitor = 'email | spam | inbox'
list = wrs.wr_split(splitor, text)
print(list)

result: ['now ', 'inbox “how to make ', 'spam ad” invest in hard ', 'email marketing.']

edited Jun 12 '21 at 09:13

Marzi Heidari

2,660
4
25
57

answered Jun 11 '21 at 17:04

Rao Mohammad

21
1

score 1 · Answer 13 · answered Feb 26 '21 at 08:27

1

May I just leave it here

s = 'foo/bar spam\neggs'
print(s.replace('/', '+++/+++').replace(' ', '+++ +++').replace('\n', '+++\n+++').split('+++'))

['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']

answered Feb 26 '21 at 08:27

Marat Zakirov

905
1
8
13

Karthick Hari · Answer 14 · 2023-05-25T09:05:52.513

How to split a string in python including whitespace or condinuoes whitespace ?

def splitWithSpace(string):
    list_strings = list(string)
    split_list = []
    new_word = ""
    for charactor in list_strings:
        if charactor == " ":
            split_list.extend([new_word, " "]) if new_word  else split_list.append(" ")
            new_word = ""
        else:
            new_word += charactor
    split_list.append(new_word)
    print(split_list)

Single Space:

splitWithSpace("this is a simple text")

Answer: ['this', ' ', 'is', ' ', 'a', ' ', 'simple', ' ', 'text']

More Space:

splitWithSpace("this is  a  simple text")

Answer: ['this', ' ', 'is', ' ', ' ', 'a', ' ', ' ', 'simple', ' ', 'text']

score 0 · Answer 15 · answered Dec 12 '18 at 15:20

0

I had a similar issue trying to split a file path and struggled to find a simple answer. This worked for me and didn't involve having to substitute delimiters back into the split text:

my_path = 'folder1/folder2/folder3/file1'

import re

re.findall('[^/]+/|[^/]+', my_path)

returns:

['folder1/', 'folder2/', 'folder3/', 'file1']

answered Dec 12 '18 at 15:20

Conor

535
2
8
16

This can be slightly simplified by using: `re.findall('[^/]+/?', my_path)` (e.g. making the trailing slash optional using a `?` rather than providing two alternatives with `|`. – Matthijs Kooijman Sep 19 '19 at 09:30
for paths, you're far better off using the stdlib `os.path` functions – anon01 Aug 21 '20 at 06:31

Chen Levy · Answer 16 · 2019-11-11T15:17:36.013

I found this generator based approach more satisfying:

def split_keep(string, sep):
    """Usage:
    >>> list(split_keep("a.b.c.d", "."))
    ['a.', 'b.', 'c.', 'd']
    """
    start = 0
    while True:
        end = string.find(sep, start) + 1
        if end == 0:
            break
        yield string[start:end]
        start = end
    yield string[start:]

It avoids the need to figure out the correct regex, while in theory should be fairly cheap. It doesn't create new string objects and, delegates most of the iteration work to the efficient find method.

... and in Python 3.8 it can be as short as:

def split_keep(string, sep):
    start = 0
    while (end := string.find(sep, start) + 1) > 0:
        yield string[start:end]
        start = end
    yield string[start:]

score 0 · Answer 17 · answered Apr 14 '21 at 04:08

Use re.split and also your regular expression comes from variable and also you have multi separator ,you can use as the following:

# BashSpecialParamList is the special param in bash,
# such as your separator is the bash special param
BashSpecialParamList = ["$*", "$@", "$#", "$?", "$-", "$$", "$!", "$0"]
# aStr is the the string to be splited
aStr = "$a Klkjfd$0 $? $#%$*Sdfdf"

reStr = "|".join([re.escape(sepStr) for sepStr in BashSpecialParamList])

re.split(f'({reStr})', aStr)

# Then You can get the result:
# ['$a Klkjfd', '$0', ' ', '$?', ' ', '$#', '%', '$*', 'Sdfdf']

reference: GNU Bash Special Parameters

score 0 · Answer 18 · answered Jun 18 '21 at 16:40

0

Some of those answers posted before, will repeat delimiter, or have some other bugs which I faced in my case. You can use this function, instead:

def split_and_keep_delimiter(input, delimiter):
    result      = list()
    idx         = 0
    while delimiter in input:
        idx     = input.index(delimiter);
        result.append(input[0:idx+len(delimiter)])
        input = input[idx+len(delimiter):]
    result.append(input)
    return result

answered Jun 18 '21 at 16:40

Tayyebi

131
2
15

1

This function is incorrect - it sometimes returns an empty string at the end. Try `test_splitter('Hello World ! ',' ')` and you get `['Hello ', 'World ', ' ', '! ', ' ', ' ', '']` – Ryan Burgert Aug 31 '22 at 20:43

score 0 · Answer 19 · answered Aug 31 '22 at 21:58

In the below code, there is a simple, very efficient and well tested answer to this question. The code has comments explaining everything in it.

I promise it's not as scary as it looks - it's actually only 13 lines of code! The rest are all comments, docs and assertions

def split_including_delimiters(input: str, delimiter: str):
    """
    Splits an input string, while including the delimiters in the output
    
    Unlike str.split, we can use an empty string as a delimiter
    Unlike str.split, the output will not have any extra empty strings
    Conequently, len(''.split(delimiter))== 0 for all delimiters,
       whereas len(input.split(delimiter))>0 for all inputs and delimiters
    
    INPUTS:
        input: Can be any string
        delimiter: Can be any string

    EXAMPLES:
         >>> split_and_keep_delimiter('Hello World  ! ',' ')
        ans = ['Hello ', 'World ', ' ', '! ', ' ']
         >>> split_and_keep_delimiter("Hello**World**!***", "**")
        ans = ['Hello', '**', 'World', '**', '!', '**', '*']
    EXAMPLES:
        assert split_and_keep_delimiter('-xx-xx-','xx') == ['-', 'xx', '-', 'xx', '-'] # length 5
        assert split_and_keep_delimiter('xx-xx-' ,'xx') == ['xx', '-', 'xx', '-']      # length 4
        assert split_and_keep_delimiter('-xx-xx' ,'xx') == ['-', 'xx', '-', 'xx']      # length 4
        assert split_and_keep_delimiter('xx-xx'  ,'xx') == ['xx', '-', 'xx']           # length 3
        assert split_and_keep_delimiter('xxxx'   ,'xx') == ['xx', 'xx']                # length 2
        assert split_and_keep_delimiter('xxx'    ,'xx') == ['xx', 'x']                 # length 2
        assert split_and_keep_delimiter('x'      ,'xx') == ['x']                       # length 1
        assert split_and_keep_delimiter(''       ,'xx') == []                          # length 0
        assert split_and_keep_delimiter('aaa'    ,'xx') == ['aaa']                     # length 1
        assert split_and_keep_delimiter('aa'     ,'xx') == ['aa']                      # length 1
        assert split_and_keep_delimiter('a'      ,'xx') == ['a']                       # length 1
        assert split_and_keep_delimiter(''       ,''  ) == []                          # length 0
        assert split_and_keep_delimiter('a'      ,''  ) == ['a']                       # length 1
        assert split_and_keep_delimiter('aa'     ,''  ) == ['a', '', 'a']              # length 3
        assert split_and_keep_delimiter('aaa'    ,''  ) == ['a', '', 'a', '', 'a']     # length 5
    """

    # Input assertions
    assert isinstance(input,str), "input must be a string"
    assert isinstance(delimiter,str), "delimiter must be a string"

    if delimiter:
        # These tokens do not include the delimiter, but are computed quickly
        tokens = input.split(delimiter)
    else:
        # Edge case: if the delimiter is the empty string, split between the characters
        tokens = list(input)
        
    # The following assertions are always true for any string input and delimiter
    # For speed's sake, we disable this assertion
    # assert delimiter.join(tokens) == input

    output = tokens[:1]

    for token in tokens[1:]:
        output.append(delimiter)
        if token:
            output.append(token)
    
    # Don't let the first element be an empty string
    if output[:1]==['']:
        del output[0]
        
    # The only case where we should have an empty string in the output is if it is our delimiter
    # For speed's sake, we disable this assertion
    # assert delimiter=='' or '' not in output
        
    # The resulting strings should be combinable back into the original string
    # For speed's sake, we disable this assertion
    # assert ''.join(output) == input

    return output

score 0 · Answer 20 · answered Nov 14 '22 at 11:14

0

>>> line = 'hello_toto_is_there'
>>> sep = '_'
>>> [sep + x[1] if x[0] != 0 else x[1] for x in enumerate(line.split(sep))]
['hello', '_toto', '_is', '_there']

answered Nov 14 '22 at 11:14

user3277560

249
2
9

score 0 · Answer 21 · answered Jun 28 '23 at 07:38

An implementation that uses only list (with help of str.partition()):

import typing as t


def partition(s: str, seps: t.Iterable[str]):
    if not s or not seps:
        return [s]
    st1, st2 = [s], []
    for sep in set(seps):
        if st1:
            while st1:
                st2.append(st1.pop())
                while True:
                    x1, x2, x3 = st2.pop().rpartition(sep)
                    if not x2:  # `sep` not found
                        st2.append(x3)
                        break
                    if not x1:
                        st2.extend([x3, x2] if x3 else [x2])
                        break
                    st2.extend([x3, x2, x1] if x3 else [x2, x1])
        else:
            while st2:
                st1.append(st2.pop())
                while True:
                    x1, x2, x3 = st1.pop().partition(sep)
                    if not x2:  # `sep` not found
                        st1.append(x1)
                        break
                    if not x3:
                        st1.extend([x1, x2] if x1 else [x2])
                        break
                    st1.extend([x1, x2, x3] if x1 else [x2, x3])
    return st1 or list(reversed(st2))

assert partition('abcdbcd', ['a']) == ['a', 'bcdbcd']
assert partition('abcdbcd', ['b']) == ['a', 'b', 'cd', 'b', 'cd']
assert partition('abcdbcd', ['d']) == ['abc', 'd', 'bc', 'd']
assert partition('abcdbcd', ['e']) == ['abcdbcd']
assert partition('abcdbcd', ['b', 'd']) == ['a', 'b', 'c', 'd', 'b', 'c', 'd']
assert partition('abcdbcd', ['db']) == ['abc', 'db', 'cd']

In Python, how do I split a string and keep the separators?

21 Answers21

Linked

Related