1

I have a string with a bunch of non-ASCII characters and I would like to remove it. I used the following function in Python 3:

def removeNonAscii(s): 
    return "".join(filter(lambda x: ord(x)<128, s))

str1 = "Hi there!\xc2\xa0My\xc2\xa0name\xc2\xa0is\xc2\xa0Blue "
new = removeNonAscii(str1)

The new string becomes:

Hi there!MynameisBlue

Is it possible to get spaces between the string such that it is:

Hi there! My name is Blue

jamylak
  • 128,818
  • 30
  • 231
  • 230
lost9123193
  • 10,460
  • 26
  • 73
  • 113
  • [`def removeNonAscii(s): return "".join(filter(lambda x: ord(x)<128, s))`](http://stackoverflow.com/questions/1342000/how-to-replace-non-ascii-characters-in-string) and [here](http://stackoverflow.com/questions/8689795/python-remove-non-ascii-characters-but-leave-periods-and-spaces) is one more helpful Q&A – Grijesh Chauhan May 26 '13 at 05:37
  • @GrijeshChauhan: It is the same piece of code that OP has! – nhahtdh May 26 '13 at 05:39
  • @GrijeshChauhan that's what I used, but I still have the same problem as mentioned above – lost9123193 May 26 '13 at 05:41
  • for reference, the correct way to do the original task (without adding spaces) is `new=str1.encode('ascii','ignore')`, using the 'errors' argument to `encode()`. – kampu May 26 '13 at 07:01
  • @nhahtdh my mistake I commented based on question title :( – Grijesh Chauhan May 26 '13 at 12:36

2 Answers2

3

The code below is equivalent to your current code, except that for a contiguous sequence of characters outside the range of US-ASCII, it will replace the whole sequence with a single space (ASCII 32).

import re
re.sub(r'[^\x00-\x7f]+', " ", inputString)

Do note that control characters are allowed by the code above, and also the code in the question.

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
0

regex wins here, but FWIW here is an itertools.groupby solution:

from itertools import groupby
text = "Hi there!\xc2\xa0My\xc2\xa0name\xc2\xa0is\xc2\xa0Blue "
def valid(c):
    return ord(c) < 128

def removeNonAscii(s):
    return ''.join(''.join(g) if k else ' ' for k, g in groupby(s, valid))

>>> removeNonAscii(text)
'Hi there! My name is Blue '
jamylak
  • 128,818
  • 30
  • 231
  • 230