2

Given a string like this:

顺便采买些喜欢的CD和DVD或vcd。

The desired output is:

顺便采买些喜欢的 CD 和 DVD 或 vcd 。

I've tried looking through each character and checking whether the character before and after is an ascii, and using the following conditions decide whether I should pad a space:

  • Check the "ascii-ness" of the current character
  • If the previous character's "ascii-ness" is NOT the same as the current's, left-pad a space

But I've been doing it as such and it looks inefficient:

def addSpace(text):
  currIsAscii = None; prevIsAscii = None; newsentence = ""
  for i in text:
    try:
      i.decode('ascii')
      currIsAscii = True
    except:
      currIsAscii = False
    if prevIsAscii != currIsAscii:
      newsentence+=" "
      newsentence+=i
    else:
      newsentence+=i
    prevIsAscii = currIsAscii
    while "  " in newsentence:
      newsentence = newsentence.replace("  ", " ")
  return newsentence.strip()

This code works in Python2 but, the i.decode('ascii') part is not a Python2 and Python3 compatible solution, I've seen How to check if a string in Python is in ASCII? but there's no solution for both Python 2 AND 3.

Is there a way to check ascii-ness of character such that it works on both Python 2 and 3?

Other than looping through each character? Is there another way to pad space of at the start and end of an ascii substring?


Another quirk with the code above is that it doesn't handle codepoints beyond [a-zA-Z0-9], e.g. when the word "Café。" -> "Caf é。", the desired output would be "Café 。"

Try this sentence:

s= u"顺便采买些喜欢的CD和DVD或Café。"

(For some reason I can't put the desired output since SO thinks it's spam, so I'll just verbally describe. The whole substring "Café" should be padded, not separated into 2 substrings.

The detection of the substring needs to include accented latin characters.

Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738

2 Answers2

4

In Python3

import re
s= "顺便采买些喜欢的CD和DVD或Café。"
re.sub("([A-Za-z0-9À-Öà-ÿ]+)"," \\1 ",s)

[out]:

顺便采买些喜欢的 CD 和 DVD 或 Café 。

regex: https://pypi.python.org/pypi/regex

pip install regex

import regex
regex.sub("(\p{Latin}+)"," \\1 ",s)
python必须死
  • 999
  • 10
  • 12
1

You may use the split() functionality in the regex module to split your query string where a letter in range a-z or A-Z if found and later join all the split elements with space to get the desired results as:

import re
s = u"顺便采买些喜欢的CD和DVD或vcd"
print " ".join(re.split(r"([a-zA-Z]+)", s))
ZdaR
  • 22,343
  • 7
  • 66
  • 87