1

In Python, how can I split a string using multiple delimiters and know which delimiter was used to separate any two elements?

E.g. in the following example taken from this post:

>>> a='Beautiful, is; better*than\nugly'
>>> import re
>>> re.split('; |, |\*|\n',a)
['Beautiful', 'is', 'better', 'than', 'ugly']

how can I determine that the separator which separated 'is' and 'better' was '; '?

awk has a useful way to accomplish this with patsplit(string, array [, fieldpat [, seps ] ]), where seps is an array that holds the separator that separated two elements. In this case, seps[1] would be ', ', seps[2] would be '; ', seps[3] would be '*', and seps[4] would be '\n'. I didn't see a similar feature in re.split.

petezurich
  • 9,280
  • 9
  • 43
  • 57
Rusty Lemur
  • 1,697
  • 1
  • 21
  • 54

1 Answers1

3

You can capture the Regex pattern to get the separators in the output:

In [16]: a = 'Beautiful, is; better*than\nugly'                                                                                                                                                             

In [17]: re.split(r'(; |, |\*|\n)', a)                                                                                                                                                                      
Out[17]: ['Beautiful', ', ', 'is', '; ', 'better', '*', 'than', '\n', 'ugly']

Then you can get the separators with usual index notations.

Now if you want the spitted words, slice from index 0 with a step of 2:

In [18]: re.split(r'(; |, |\*|\n)', a)[::2]                                                                                                                                                                 
Out[18]: ['Beautiful', 'is', 'better', 'than', 'ugly']

to get the separators, slice from index 1 with a step of 2:

In [19]: re.split(r'(; |, |\*|\n)', a)[1::2]                                                                                                                                                                
Out[19]: [', ', '; ', '*', '\n']
heemayl
  • 39,294
  • 7
  • 70
  • 76