Given a string like this:
顺便采买些喜欢的CD和DVD或vcd。
The desired output is:
顺便采买些喜欢的 CD 和 DVD 或 vcd 。
I've tried looking through each character and checking whether the character before and after is an ascii, and using the following conditions decide whether I should pad a space:
- Check the "ascii-ness" of the current character
- If the previous character's "ascii-ness" is NOT the same as the current's, left-pad a space
But I've been doing it as such and it looks inefficient:
def addSpace(text):
currIsAscii = None; prevIsAscii = None; newsentence = ""
for i in text:
try:
i.decode('ascii')
currIsAscii = True
except:
currIsAscii = False
if prevIsAscii != currIsAscii:
newsentence+=" "
newsentence+=i
else:
newsentence+=i
prevIsAscii = currIsAscii
while " " in newsentence:
newsentence = newsentence.replace(" ", " ")
return newsentence.strip()
This code works in Python2 but, the i.decode('ascii')
part is not a Python2 and Python3 compatible solution, I've seen How to check if a string in Python is in ASCII? but there's no solution for both Python 2 AND 3.
Is there a way to check ascii-ness of character such that it works on both Python 2 and 3?
Other than looping through each character? Is there another way to pad space of at the start and end of an ascii substring?
Another quirk with the code above is that it doesn't handle codepoints beyond [a-zA-Z0-9]
, e.g. when the word "Café。" -> "Caf é。", the desired output would be "Café 。"
Try this sentence:
s= u"顺便采买些喜欢的CD和DVD或Café。"
(For some reason I can't put the desired output since SO thinks it's spam, so I'll just verbally describe. The whole substring "Café" should be padded, not separated into 2 substrings.
The detection of the substring needs to include accented latin characters.