How to check whether a substring are of ascii+latin characters and pad the ascii/latin characters substring with spaces?

Question

Given a string like this:

顺便采买些喜欢的CD和DVD或vcd。

The desired output is:

顺便采买些喜欢的 CD 和 DVD 或 vcd 。

I've tried looking through each character and checking whether the character before and after is an ascii, and using the following conditions decide whether I should pad a space:

Check the "ascii-ness" of the current character
If the previous character's "ascii-ness" is NOT the same as the current's, left-pad a space

But I've been doing it as such and it looks inefficient:

def addSpace(text):
  currIsAscii = None; prevIsAscii = None; newsentence = ""
  for i in text:
    try:
      i.decode('ascii')
      currIsAscii = True
    except:
      currIsAscii = False
    if prevIsAscii != currIsAscii:
      newsentence+=" "
      newsentence+=i
    else:
      newsentence+=i
    prevIsAscii = currIsAscii
    while "  " in newsentence:
      newsentence = newsentence.replace("  ", " ")
  return newsentence.strip()

This code works in Python2 but, the i.decode('ascii') part is not a Python2 and Python3 compatible solution, I've seen How to check if a string in Python is in ASCII? but there's no solution for both Python 2 AND 3.

Is there a way to check ascii-ness of character such that it works on both Python 2 and 3?

Other than looping through each character? Is there another way to pad space of at the start and end of an ascii substring?

Another quirk with the code above is that it doesn't handle codepoints beyond [a-zA-Z0-9], e.g. when the word "Café。" -> "Caf é。", the desired output would be "Café 。"

Try this sentence:

s= u"顺便采买些喜欢的CD和DVD或Café。"

(For some reason I can't put the desired output since SO thinks it's spam, so I'll just verbally describe. The whole substring "Café" should be padded, not separated into 2 substrings.

The detection of the substring needs to include accented latin characters.

Shouldn't you try to _encode_ to ascii? Is text a unicode string? — RemcoGerlich, Jan 25 '17 at 09:05
It's usually a unicode string that contains ascii substrings. And knowing where it's ascii and the start and end points of the offset is needed to pad the space. — alvas, Jan 25 '17 at 09:07

python必须死 · Accepted Answer · 2017-01-27T16:17:06.793

4

In Python3

import re
s= "顺便采买些喜欢的CD和DVD或Café。"
re.sub("([A-Za-z0-9À-Öà-ÿ]+)"," \\1 ",s)

[out]:

顺便采买些喜欢的 CD 和 DVD 或 Café 。

regex: https://pypi.python.org/pypi/regex

pip install regex

import regex
regex.sub("(\p{Latin}+)"," \\1 ",s)

edited Jan 27 '17 at 16:17

answered Jan 25 '17 at 09:14

python必须死

999
10
12

1

Thanks! Stackoverflow doesn't use the triple ``` flavored markdown =) – alvas Jan 25 '17 at 09:38

score 1 · Answer 2 · answered Jan 25 '17 at 09:20

You may use the split() functionality in the regex module to split your query string where a letter in range a-z or A-Z if found and later join all the split elements with space to get the desired results as:

import re
s = u"顺便采买些喜欢的CD和DVD或vcd"
print " ".join(re.split(r"([a-zA-Z]+)", s))

How to check whether a substring are of ascii+latin characters and pad the ascii/latin characters substring with spaces?

2 Answers2