Pypy Unicode Split String

Question

I'm trying out this code in [PyPy 5.1.2 with GCC 5.3.1 20160413]

hiragana = "あえいおう"
regular = "aeiou"
mixed = "あえいおうaeiou"

print hiragana.split("い")
# ['\xe3\x81\x82\xe3\x81\x88', '\xe3\x81\x8a\xe3\x81\x86']
print regular.split("i")
# ['ae', 'ou']

I want to split the mixed string to get this.

# [ "\xe3\x81\x82\xe3\x81\x88", "\xe3\x81\x8a\xe3\x81\x86ae", "ou"]

The re module produces an unexpected result.

print re.split("[いi]", mixed)
# ['', '', '\x82', '', '\x88', '', '', '', '', '\x8a', '', '\x86ae', 'ou']

Questions:

Does python have a split using multiple delimiters function?

Don't try and split on UTF-8 bytes; you'd be better off decoding to a unicode string object *first*. `re.split(ur'[いi]', mixed.decode('utf8'))`. Otherwise, putting `い` into a `[..]` character sequence tells the regex to split on any of the 3 bytes that encode that codepoint. — Martijn Pieters, Mar 06 '17 at 09:37
The alternative is to split on individual byte sequences: `re.split(r'(?:い|i)', mixed)` (so split either on the three UTF-8 bytes of `い`, *or* on the single byte for `i`). — Martijn Pieters, Mar 06 '17 at 09:39

score 0 · Accepted Answer · answered Mar 06 '17 at 09:42

0

Works for me both with python and pypy.

import re

mixed = "あえいおうaeiou"

print re.split(r'い|i', mixed)
# ['\xe3\x81\x82\xe3\x81\x88', '\xe3\x81\x8a\xe3\x81\x86ae', 'ou']

answered Mar 06 '17 at 09:42

Organis

7,243
2
12
14

Pypy Unicode Split String

1 Answers1