-1

I'm trying out this code in [PyPy 5.1.2 with GCC 5.3.1 20160413]

hiragana = "あえいおう"
regular = "aeiou"
mixed = "あえいおうaeiou"

print hiragana.split("い")
# ['\xe3\x81\x82\xe3\x81\x88', '\xe3\x81\x8a\xe3\x81\x86']
print regular.split("i")
# ['ae', 'ou']

I want to split the mixed string to get this.

# [ "\xe3\x81\x82\xe3\x81\x88", "\xe3\x81\x8a\xe3\x81\x86ae", "ou"]

The re module produces an unexpected result.

print re.split("[いi]", mixed)
# ['', '', '\x82', '', '\x88', '', '', '', '', '\x8a', '', '\x86ae', 'ou']

Questions:

Does python have a split using multiple delimiters function?

ᴀʀᴍᴀɴ
  • 4,443
  • 8
  • 37
  • 57
Jon Abaca
  • 229
  • 2
  • 10
  • Don't try and split on UTF-8 bytes; you'd be better off decoding to a unicode string object *first*. `re.split(ur'[いi]', mixed.decode('utf8'))`. Otherwise, putting `い` into a `[..]` character sequence tells the regex to split on any of the 3 bytes that encode that codepoint. – Martijn Pieters Mar 06 '17 at 09:37
  • The alternative is to split on individual byte sequences: `re.split(r'(?:い|i)', mixed)` (so split either on the three UTF-8 bytes of `い`, *or* on the single byte for `i`). – Martijn Pieters Mar 06 '17 at 09:39

1 Answers1

0

Works for me both with python and pypy.

import re

mixed = "あえいおうaeiou"

print re.split(r'い|i', mixed)
# ['\xe3\x81\x82\xe3\x81\x88', '\xe3\x81\x8a\xe3\x81\x86ae', 'ou']
Organis
  • 7,243
  • 2
  • 12
  • 14