761

I found some answers online, but I have no experience with regular expressions, which I believe is what is needed here.

I have a string that needs to be split by either a ';' or ', ' That is, it has to be either a semicolon or a comma followed by a space. Individual commas without trailing spaces should be left untouched

Example string:

"b-staged divinylsiloxane-bis-benzocyclobutene [124221-30-3], mesitylene [000108-67-8]; polymerized 1,2-dihydro-2,2,4- trimethyl quinoline [026780-96-1]"

should be split into a list containing the following:

('b-staged divinylsiloxane-bis-benzocyclobutene [124221-30-3]' , 'mesitylene [000108-67-8]', 'polymerized 1,2-dihydro-2,2,4- trimethyl quinoline [026780-96-1]') 
jww
  • 97,681
  • 90
  • 411
  • 885
gt565k
  • 7,755
  • 3
  • 16
  • 9

5 Answers5

1239

Luckily, Python has this built-in :)

import re
re.split('; |, ', string_to_split)

Update:
Following your comment:

>>> a='Beautiful, is; better*than\nugly'
>>> import re
>>> re.split('; |, |\*|\n',a)
['Beautiful', 'is', 'better', 'than', 'ugly']
Błażej Michalik
  • 4,474
  • 40
  • 55
Jonathan Livni
  • 101,334
  • 104
  • 266
  • 359
  • @Paul There isn't. You aren't understanding regex properly if you think there is. See my comment on your post below. – alldayremix Feb 21 '13 at 23:23
  • 28
    I'd prefer to write it as: re.split(r';|,\s', a) by replacing ' ' (space character) with '\s' (white space) unless space character is a strict requirement. – Humble Learner Sep 12 '13 at 20:51
  • 114
    I wonder why (regular) split just can't accept a list, that seems like a more obvious way instead of encoding multiple options in a line. – himself Jun 12 '14 at 16:02
  • Its a regex feature, not a python one. :) And Pythons regex is a little lame (but usually good enough). But that's why we have the regex module. – ThorSummoner May 11 '15 at 18:35
  • is it possible to return the delimiters in array too? example ['Beautiful', ',' 'is' , ';' , 'better', '*' ,' than', '\n' ,'ugly'] – amir jj Oct 23 '16 at 05:47
  • 12
    It is worth nothing that this uses some RegEx like things as mentioned above. So trying to split a string with . will split every single character. You need to escape it. \. – marsh Nov 14 '16 at 15:38
  • 69
    Just to add to this a little bit, instead of adding a bunch of or "|" symbols you can do the following: re.split('[;,.\-\%]',str), where inside of [ ] you put all the characters you want to split by. – jmracek Nov 06 '17 at 21:31
  • Is there a way to know which delimiter is actually used for a specific split? In the above example, than and ugly are split by '\' and better and than are split by '*'. – srinivasu u Feb 14 '18 at 04:59
  • @jmracek: Thanks for this comment, I had to use it to make my split work – pnv Apr 26 '19 at 10:26
  • 4
    Is there a way to retain the delimiters in the output but combine them together? I know that doing `re.split('(; |, |\*|\n)', a)` will retain the delimiters, but how can I combine subsequent delimiters into one element in the output list? – Konstantin Jul 20 '20 at 14:35
  • 1
    @jmracek That's worthy of a standalone answer – WestCoastProjects Dec 27 '20 at 19:28
  • 1
    Luckily you say ;) – micsthepick Feb 13 '21 at 09:07
  • For people not very familiar with regex, note that there are some common separators that have to be escaped - for example `.` and `?` need to be escaped as `\.` and ```\?```. More info can be found here: https://riptutorial.com/regex/example/15848/what-characters-need-to-be-escaped- – Gray-lab Apr 28 '22 at 21:30
  • @jonathan-livni how do I do this to user input strings such as `list(map(int, input().split()))` ? – 0_0perplexed Jul 17 '22 at 23:24
  • I voted this answer but dont like the need to list all possibilities of spaces like ",\s", ",\s\s", "\s,\s" and so on. That's why I prefer to split in each character, than throw away empty slices. `[s for s in re.split(r'[;,\*\n\s]', a) if s]` – Anselmo Blanco Dominguez Mar 24 '23 at 14:20
484

Do a str.replace('; ', ', ') and then a str.split(', ')

Joe
  • 11,147
  • 7
  • 49
  • 60
  • 97
    suppose you have a 5 delimeters, you have to traverse your string 5x times – om-nom-nom Sep 26 '12 at 23:23
  • 10
    that is very bad for performance – Phyo Arkar Lwin Nov 26 '12 at 18:04
  • 33
    This shows a different vision of yours toward this problem. I think it is a great one. "If you don't know a direct answer, use combination of things you know to solve it". – AliBZ Jul 23 '13 at 18:04
  • 38
    If you have small number of delimiters and are perormance-constrained, `replace` trick is fastest of all. 15x faster than regexp, and almost 2x faster than nested `for in val.split(...)` generator. – monoid May 23 '16 at 07:36
  • what if an array has empty slots? ['6', '1862', '5', '1863', '222', '', '', '', '', ''] – June Wang Sep 24 '19 at 08:11
  • @JuneWang one method might be to loop through the elements of the array and upon finding an empty element or an element which you desire to remove, remove that from the array by using array.remove(element) – Gjison Giocf Mar 23 '20 at 19:29
  • 2
    Of course you will get better performance by using re.split() if you have multiple separator characters, but this is a very smart and easy-to-understand way of solving the problem. – pedram bashiri Sep 08 '20 at 20:05
  • 4
    Performance is not always a concern. My use case was to process input from a human-entered command line argument so this solution was quite ideal. I also try to avoid regex whenever possible. Easy to create, very difficult to read. – Craig Jackson Oct 09 '20 at 19:51
180

Here's a safe way for any iterable of delimiters, using regular expressions:

>>> import re
>>> delimiters = "a", "...", "(c)"
>>> example = "stackoverflow (c) is awesome... isn't it?"
>>> regex_pattern = '|'.join(map(re.escape, delimiters))
>>> regex_pattern
'a|\\.\\.\\.|\\(c\\)'
>>> re.split(regex_pattern, example)
['st', 'ckoverflow ', ' is ', 'wesome', " isn't it?"]

re.escape allows to build the pattern automatically and have the delimiters escaped nicely.

Here's this solution as a function for your copy-pasting pleasure:

def split(delimiters, string, maxsplit=0):
    import re
    regex_pattern = '|'.join(map(re.escape, delimiters))
    return re.split(regex_pattern, string, maxsplit)

If you're going to split often using the same delimiters, compile your regular expression beforehand like described and use RegexObject.split.


If you'd like to leave the original delimiters in the string, you can change the regex to use a lookbehind assertion instead:

>>> import re
>>> delimiters = "a", "...", "(c)"
>>> example = "stackoverflow (c) is awesome... isn't it?"
>>> regex_pattern = '|'.join('(?<={})'.format(re.escape(delim)) for delim in delimiters)
>>> regex_pattern
'(?<=a)|(?<=\\.\\.\\.)|(?<=\\(c\\))'
>>> re.split(regex_pattern, example)
['sta', 'ckoverflow (c)', ' is a', 'wesome...', " isn't it?"]

(replace ?<= with ?= to attach the delimiters to the righthand side, instead of left)

UCYT5040
  • 367
  • 3
  • 15
Kos
  • 70,399
  • 25
  • 169
  • 233
93

In response to Jonathan's answer above, this only seems to work for certain delimiters. For example:

>>> a='Beautiful, is; better*than\nugly'
>>> import re
>>> re.split('; |, |\*|\n',a)
['Beautiful', 'is', 'better', 'than', 'ugly']

>>> b='1999-05-03 10:37:00'
>>> re.split('- :', b)
['1999-05-03 10:37:00']

By putting the delimiters in square brackets it seems to work more effectively.

>>> re.split('[- :]', b)
['1999', '05', '03', '10', '37', '00']
Paul
  • 1,874
  • 1
  • 19
  • 26
  • 19
    It works for all the delimiters you specify. A regex of `- :` matches exactly `- :` and thus won't split the date/time string. A regex of `[- :]` matches `-`, ``, or `:` and thus splits the date/time string. If you want to split only on `-` and `:` then your regex should be either `[-:]` or `-|:`, and if you want to split on `-`, `` and `:` then your regex should be either `[- :]` or `-| |:`. – alldayremix Feb 21 '13 at 23:11
  • 6
    @alldayremix I see my mistake: I missed the fact that your regex contains the OR |. I blindly identified it as a desired separator. – Paul Apr 04 '13 at 11:15
39

This is how the regex look like:

import re
# "semicolon or (a comma followed by a space)"
pattern = re.compile(r";|, ")

# "(semicolon or a comma) followed by a space"
pattern = re.compile(r"[;,] ")

print pattern.split(text)
Jochen Ritzel
  • 104,512
  • 31
  • 200
  • 194