Splitting on first occurrence

Question

What would be the best way to split a string on the first occurrence of a delimiter?

For example:

"123mango abcd mango kiwi peach"

splitting on the first mango to get:

"abcd mango kiwi peach"

_{To split on the last occurrence instead, see partition string in python and get value of last segment after colon.}

wouldn't there be a space in the split result as in `" abcd ..."` ? — omsrisagar, Aug 18 '23 at 23:57

score 816 · Accepted Answer · edited Jun 20 '20 at 09:12

816

From the docs:

str.split([sep[, maxsplit]])

Return a list of the words in the string, using sep as the delimiter string. If maxsplit is given, at most maxsplit splits are done (thus, the list will have at most maxsplit+1 elements).

s.split('mango', 1)[1]

edited Jun 20 '20 at 09:12

Community

1
1

answered Aug 01 '11 at 19:48

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

2

Note: if more splits can be performed after reaching the `maxsplit` count, the last element in the list will contain the remainder of the string (inclusive of any `sep` chars/strings). – BuvinJ Sep 10 '19 at 13:01
`.partition` isn't as well known (nor its counterpart `.rpartition`), but it's faster and simpler for this case. It has only been available since 2.5, but 2.5 was quite mature when this was originally written (and 2.7 and 3.2 were available). – Karl Knechtel Jan 22 '23 at 13:01

utdemir · Answer 2 · 2011-08-01T19:56:28.297

94

>>> s = "123mango abcd mango kiwi peach"
>>> s.split("mango", 1)
['123', ' abcd mango kiwi peach']
>>> s.split("mango", 1)[1]
' abcd mango kiwi peach'

edited Aug 01 '11 at 19:56

answered Aug 01 '11 at 19:47

utdemir

26,532
10
62
81

9

@Swiss: So what. The technique is still the same. – Ignacio Vazquez-Abrams Aug 01 '11 at 19:55
7

@Ignacio: I'm just pointing it out. No reason to have a partially correct answer in place of a completely correct one. – Swiss Aug 01 '11 at 19:57
Technically assumes the correct delimiter. The 'first' is the [1] index. The one we are all referencing would of course be the zero-ith index. :D Semantics. – Nov 15 '17 at 13:19
``"value" parameter must be a scalar or dict, but you passed a "list"``i got this returned with ``s.split("mango", 1)[1]`` – yuliansen Sep 28 '20 at 06:14
In my case I had to use s.split("mango", 1,expand=True)[1] on Pandas , because I was getting an error – Alvaro Parra Dec 15 '21 at 13:55
1

@AlvaroParra well, yes; if you're using the `.split` method of a Pandas `Series`, it will work differently than the `.split` method of a built-in string. – Karl Knechtel Jan 22 '23 at 12:53

score 36 · Answer 3 · answered Jun 09 '14 at 08:26

36

For me the better approach is that:

s.split('mango', 1)[-1]

...because if happens that occurrence is not in the string you'll get "IndexError: list index out of range".

Therefore -1 will not get any harm cause number of occurrences is already set to one.

answered Jun 09 '14 at 08:26

Alex

3,167
6
35
50

2

As written before it is number of occurrences in which method split() is being applied. Method will find and apply only first 'mango' string. – Alex Jul 01 '17 at 06:57
1

Attention, this really depends on what are you going to use the result for. In many cases you will need to know if the string was split or you will need the program to fail if the split did not happen. --- With your implementation the problem would be silently skipped and it would be much more complicated to find its cause. – pabouk - Ukraine stay strong May 17 '22 at 15:19

score 21 · Answer 4 · answered Oct 30 '19 at 16:40

21

You can also use str.partition:

>>> text = "123mango abcd mango kiwi peach"

>>> text.partition("mango")
('123', 'mango', ' abcd mango kiwi peach')

>>> text.partition("mango")[-1]
' abcd mango kiwi peach'

>>> text.partition("mango")[-1].lstrip()  # if whitespace strip-ing is needed
'abcd mango kiwi peach'

The advantage of using str.partition is that it's always gonna return a tuple in the form:

(<pre>, <separator>, <post>)

So this makes unpacking the output really flexible as there's always going to be 3 elements in the resulting tuple.

answered Oct 30 '19 at 16:40

heemayl

39,294
7
70
76

2

This is really useful for creating key value pairs from a line of text, if some of the lines only have a key, since, as you pointed out, you always get a tuple: `key, _, value = text_line.partition(' ')` – Enterprise Dec 05 '20 at 16:51
2

You could even ignore the separator in the tuple with an one liner using slices: `key, value = text_line.partition(' ')[::2]` – giuliano-oliveira Sep 16 '21 at 18:02
This is the way. I wrote a more detailed answer, showing every plausible way to do it, and timing them. This is simpler and more performant than the `.split` approach, and others are even slower even though they might "look optimized". – Karl Knechtel Jan 22 '23 at 12:52

score 3 · Answer 5 · answered Jan 22 '23 at 12:47

Summary

The simplest and best-performing approach is to use the .partition method of the string.

Commonly, people may want to get the part either before or after the delimiter that was found, and may want to find either the first or last occurrence of the delimiter in the string. For most techniques, all of these possibilities are roughly as simple, and it is straightforward to convert from one to another.

For the below examples, we will assume:

>>> import re
>>> s = '123mango abcd mango kiwi peach'

Using `.split`

>>> s.split('mango', 1)
['123', ' abcd mango kiwi peach']

The second parameter to .split limits the number of times the string will be split. This gives the parts both before and after the delimiter; then we can select what we want.

If the delimiter does not appear, no splitting is done:

>>> s.split('grape', 1)
['123mango abcd mango kiwi peach']
Thus, to check whether the delimiter was present, check the length of the result before working with it.

Using `.partition`

>>> s.partition('mango')
('123', 'mango', ' abcd mango kiwi peach')

The result is a tuple instead, and the delimiter itself is preserved when found.

When the delimiter is not found, the result will be a tuple of the same length, with two empty strings in the result:

>>> s.partition('grape')
('123mango abcd mango kiwi peach', '', '')

Thus, to check whether the delimiter was present, check the value of the second element.

Using regular expressions

>>> # Using the top-level module functionality
>>> re.split(re.escape('mango'), s, 1)
['123', ' abcd mango kiwi peach']
>>> # Using an explicitly compiled pattern
>>> mango = re.compile(re.escape('mango'))
>>> mango.split(s, 1)
['123', ' abcd mango kiwi peach']

The .split method of regular expressions has the same argument as the built-in string .split method, to limit the number of splits. Again, no splitting is done when the delimiter does not appear:

>>> grape = re.compile(re.escape('grape'))
>>> grape.split(s, 1)
['123mango abcd mango kiwi peach']

In these examples, re.escape has no effect, but in the general case it's necessary in order to specify a delimiter as literal text. On the other hand, using the re module opens up the full power of regular expressions:

>>> vowels = re.compile('[aeiou]')
>>> # Split on any vowel, without a limit on the number of splits:
>>> vowels.split(s)
['123m', 'ng', ' ', 'bcd m', 'ng', ' k', 'w', ' p', '', 'ch']

(Note the empty string: that was found between the e and the a of peach.)

Using indexing and slicing

Use the .index method of the string to find out where the delimiter is, then slice with that:

>>> s[:s.index('mango')] # for everything before the delimiter
'123'
>>> s[s.index('mango')+len('mango'):] # for everything after the delimiter
' abcd mango kiwi peach'

This directly gives the prefix. However, if the delimiter is not found, an exception will be raised instead:

>>> s[:s.index('grape')]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: substring not found

Everything after the last occurrence, instead

Though it wasn't asked, I include related techniques here for reference.

The .split and .partition techniques have direct counterparts, to get the last part of the string (i.e., everything after the last occurrence of the delimiter). For reference:

>>> '123mango abcd mango kiwi peach'.rsplit('mango', 1)
['123mango abcd ', ' kiwi peach']
>>> '123mango abcd mango kiwi peach'.rpartition('mango')
('123mango abcd ', 'mango', ' kiwi peach')

Similarly, there is a .rindex to match .index, but it will still give the index of the beginning of the last match of the partition. Thus:

>>> s[:s.rindex('mango')] # everything before the last match
'123mango abcd '
>>> s[s.rindex('mango')+len('mango'):] # everything after the last match
' kiwi peach'

For the regular expression approach, we can fall back on the technique of reversing the input, looking for the first appearance of the reversed delimiter, reversing the individual results, and reversing the result list:

>>> ognam = re.compile(re.escape('mango'[::-1]))
>>> [x[::-1] for x in ognam.split('123mango abcd mango kiwi peach'[::-1], 1)][::-1]
['123mango abcd ', ' kiwi peach']

Of course, this is almost certainly more effort than it's worth.

Another way is to use negative lookahead from the delimiter to the end of the string:

>>> literal_mango = re.escape('mango')
>>> last_mango = re.compile(f'{literal_mango}(?!.*{literal_mango})')
>>> last_mango.split('123mango abcd mango kiwi peach', 1)
['123mango abcd ', ' kiwi peach']

Because of the lookahead, this is a worst-case O(n^2) algorithm.

Performance testing

$ python -m timeit --setup="s='123mango abcd mango kiwi peach'" "s.partition('mango')[-1]"
2000000 loops, best of 5: 128 nsec per loop
$ python -m timeit --setup="s='123mango abcd mango kiwi peach'" "s.split('mango', 1)[-1]"
2000000 loops, best of 5: 157 nsec per loop
$ python -m timeit --setup="s='123mango abcd mango kiwi peach'" "s[s.index('mango')+len('mango'):]"
1000000 loops, best of 5: 250 nsec per loop
$ python -m timeit --setup="s='123mango abcd mango kiwi peach'; import re; mango=re.compile(re.escape('mango'))" "mango.split(s, 1)[-1]"
1000000 loops, best of 5: 258 nsec per loop

Though more flexible, the regular expression approach is definitely slower. Limiting the number of splits improves performance with both the string method and regular expressions (timings without the limit are not shown, because they are slower and also give a different result), but .partition is still a clear winner.

For this test data, the .index approach was slower even though it only has to create one substring and doesn't have to iterate over text beyond the match (for the purpose of creating the other substrings). Pre-computing the length of the delimiter helps, but this is still slower than the .split and .partition approaches.

Splitting on first occurrence

5 Answers5

Summary

Using `.split`

Using `.partition`

Using regular expressions

Using indexing and slicing

Everything after the last occurrence, instead

Performance testing

Linked

Related

Splitting on first occurrence

5 Answers5

Summary

Using .split

Using .partition

Using regular expressions

Using indexing and slicing

Everything after the last occurrence, instead

Performance testing

Linked

Related

Using `.split`

Using `.partition`