How can I remove non-ASCII characters but leave periods and spaces?

Question

I'm working with a .txt file. I want a string of the text from the file with no non-ASCII characters. However, I want to leave spaces and periods. At present, I'm stripping those too. Here's the code:

def onlyascii(char):
    if ord(char) < 48 or ord(char) > 127: return ''
    else: return char

def get_my_string(file_path):
    f=open(file_path,'r')
    data=f.read()
    f.close()
    filtered_data=filter(onlyascii, data)
    filtered_data = filtered_data.lower()
    return filtered_data

How should I modify onlyascii() to leave spaces and periods? I imagine it's not too complicated but I can't figure it out.

Thanks (sincerely) for the clarification John. I understood that spaces and periods are ASCII characters. However, I was removing both of them unintentionally while trying to remove only non-ASCII characters. I see how my question might have implied otherwise. — , Dec 31 '11 at 21:38
@PoliticalEconomist: Your problem is still very under-specified. See my answer. — John Machin, Dec 31 '11 at 22:05

score 225 · Accepted Answer · edited Oct 04 '19 at 11:29

225

You can filter all characters from the string that are not printable using string.printable, like this:

>>> s = "some\x00string. with\x15 funny characters"
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'somestring. with funny characters'

string.printable on my machine contains:

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c

EDIT: On Python 3, filter will return an iterable. The correct way to obtain a string back would be:

''.join(filter(lambda x: x in printable, s))

edited Oct 04 '19 at 11:29

robertspierre

3,218
2
31
46

answered Dec 31 '11 at 18:29

jterrace

64,866
22
157
202

`chr(127) in string.printable` ? – joaquin Dec 31 '11 at 18:39
3

what's up with those printable chars that are below ordinal 48 ? – joaquin Dec 31 '11 at 18:46
chr(127) in string.printable == False – jterrace Dec 31 '11 at 18:48
Do you mean 0b and 0c? They are part of string.whitespace. – jterrace Dec 31 '11 at 18:49
yes, and from the OP: `if ord(char) < 48 or ord(char) > 127`. About my second comment, I am refering to '*' ,'(', and other printable which are eliminated by the OP... – joaquin Dec 31 '11 at 18:54
Yeah, I was extrapolating that the OP probably meant all printable characters, rather than what was actually said, but might not be the case. – jterrace Dec 31 '11 at 18:57
Thanks! I understand now. Sorry for the confusion - jterrace correctly interpreted my question. – Dec 31 '11 at 20:56
this is also great for just filtering to digits - filter(lambda x: x in string.digits, s) – rickcnagy Oct 08 '13 at 15:35
This is incredibly slow in a large file. Any suggestions? – Xodarap777 Jan 12 '14 at 22:34
@Xodarap777 create a `set(string.printable)` and re-use it for the filtering. Also don't filter the whole file at once - do it in chunks of 8K-512K – jterrace Jan 13 '14 at 00:05
41

The only problem with using `filter` is that it returns an iterable. If you need a string back (as I did because I needed this when doing list compression) then do this: `''.join(filter(lambda x: x in string.printable, s)`. – cjbarth Sep 05 '14 at 19:23
5

@cjbarth - comment is python 3 specific, but very useful. Thanks! – undershock Jan 13 '15 at 15:13
8

Why not use regular expression: `re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string)` . See this thread http://stackoverflow.com/a/20079244/658497 – Noam Manos Jan 18 '16 at 16:08
This is the most compatible way of doing the OP's task, I tested in from Python 2.6 to Python 3.5. – gaborous Jan 30 '16 at 16:31
1

@NoamManos this was 4-5 times faster for me thatn the join...filter...lambda solution, thanks. – artfulrobot Feb 22 '16 at 11:59
I suspect changing `lambda x: x in printable` to `printable.__contains__` would make it run faster; the `lambda` means more Python level code execution, while directly passing the built-in membership test method removes per character byte code execution. – ShadowRanger Apr 04 '16 at 22:52
1

[PyLint Complains](http://pylint-messages.wikidot.com/messages:w0141) on the use of `filter` when using the above code. Given that [list comprehensions seem to be preferred](http://stackoverflow.com/a/3013686/1448678) would using `''.join(x for x in s if x in printable)` be a) equivalent, and b) any better? – Jonny Jun 17 '16 at 14:37
Edit: I realise the above is a generator expression, but does the same apply? – Jonny Jun 18 '16 at 14:44
@Jonny - it's most likely equivalent, but I'd have to profile it to know for sure – jterrace Jun 19 '16 at 17:02
@Jonny, The result is the same, time differs (you need to compare if it happens to be a bottleneck). This is easier for an eye - the less the diversity of tools, the faster is reading comprehension. You may want to add an [Enter] before `if` and indent the second line so `if` starts just after `(` from the first line. – Ctrl-C Jun 06 '18 at 11:34
Am I the only one who this doesn't work for? Why wouldnt those characters be included in the printable list? like `0` or `x` for example? – Amon Jan 27 '20 at 21:36
1

@CharlesSmith - those are escape sequences – jterrace Jan 28 '20 at 20:32
when assigning value to a variable it works fine whereas reading from file has no effect on filtering.. Dont know why? any ideas? – Brajesh Jun 04 '20 at 22:05

Zweedeend · Answer 2 · 2017-07-26T10:10:19.777

114

An easy way to change to a different codec, is by using encode() or decode(). In your case, you want to convert to ASCII and ignore all symbols that are not supported. For example, the Swedish letter å is not an ASCII character:

    >>>s = u'Good bye in Swedish is Hej d\xe5'
    >>>s = s.encode('ascii',errors='ignore')
    >>>print s
    Good bye in Swedish is Hej d

Edit:

Python3: str -> bytes -> str

>>>"Hej då".encode("ascii", errors="ignore").decode()
'hej d'

Python2: unicode -> str -> unicode

>>> u"hej då".encode("ascii", errors="ignore").decode()
u'hej d'

Python2: str -> unicode -> str (decode and encode in reverse order)

>>> "hej d\xe5".decode("ascii", errors="ignore").encode()
'hej d'

edited Jul 26 '17 at 10:10

answered Aug 25 '13 at 15:50

Zweedeend

2,565
2
17
21

18

I get `UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 27` – Xodarap777 Jan 12 '14 at 22:33
2

I got that error when I put the actual unicode character in the string via copy paste. When you specify a string as u'thestring' encode works correctly. – Ben Liyanage Apr 30 '15 at 21:05
2

Works only on Py3, but it's elegant. – gaborous Jan 30 '16 at 16:32
8

For those who are getting the same error as @Xodarap777 : you should first .decode() the string, and only after that encode. For example `s.decode('utf-8').encode('ascii', errors='ignore')` – Spc_555 Mar 21 '17 at 17:40

Noam Manos · Answer 3 · 2020-11-16T17:17:37.410

41

According to @artfulrobot, this should be faster than filter and lambda:

import re
re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string)

See more examples here Replace non-ASCII characters with a single space

edited Nov 16 '20 at 17:17

answered Feb 23 '16 at 14:14

Noam Manos

15,216
3
86
85

1

This solution answers OP's stated question, but beware that it won't remove non printable characters that are included in ASCII which I think is what OP intended to ask. – Danilo Souza Morães Jun 15 '18 at 00:32

Noha Elprince · Answer 4 · 2019-07-31T22:10:05.743

8

You may use the following code to remove non-English letters:

import re
str = "123456790 ABC#%? .(朱惠英)"
result = re.sub(r'[^\x00-\x7f]',r'', str)
print(result)

This will return

123456790 ABC#%? .()

edited Jul 31 '19 at 22:10

answered Jul 30 '19 at 22:27

Noha Elprince

1,924
1
16
10

Could you explain more on the regex you used? r'[^\x00-\x7f]' – Reihan_amn Nov 14 '22 at 23:41

score 6 · Answer 5 · edited Apr 04 '16 at 22:28

Your question is ambiguous; the first two sentences taken together imply that you believe that space and "period" are non-ASCII characters. This is incorrect. All chars such that ord(char) <= 127 are ASCII characters. For example, your function excludes these characters !"#$%&\'()*+,-./ but includes several others e.g. []{}.

Please step back, think a bit, and edit your question to tell us what you are trying to do, without mentioning the word ASCII, and why you think that chars such that ord(char) >= 128 are ignorable. Also: which version of Python? What is the encoding of your input data?

Please note that your code reads the whole input file as a single string, and your comment ("great solution") to another answer implies that you don't care about newlines in your data. If your file contains two lines like this:

this is line 1
this is line 2

the result would be 'this is line 1this is line 2' ... is that what you really want?

A greater solution would include:

a better name for the filter function than onlyascii

recognition that a filter function merely needs to return a truthy value if the argument is to be retained:

def filter_func(char):
    return char == '\n' or 32 <= ord(char) <= 126
# and later:
filtered_data = filter(filter_func, data).lower()

This answer is very helpful to those of us coming in to ask something similar to the OP, and your proposed answer is helpfully pythonic. I do, however, find it strange that there isn't a more efficient solution to the problem as you interpreted it (which I often run into) - character by character, this takes a very long time in a very large file. — Xodarap777, Jan 12 '14 at 22:50

Matthew Dunn · Answer 6 · 2017-09-14T19:14:52.207

2

Working my way through Fluent Python (Ramalho) - highly recommended. List comprehension one-ish-liners inspired by Chapter 2:

onlyascii = ''.join([s for s in data if ord(s) < 127])
onlymatch = ''.join([s for s in data if s in
              'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'])

edited Sep 14 '17 at 19:14

answered Sep 14 '17 at 18:27

Matthew Dunn

135
5

This would not allow for standard ASCII symbols, such as bullet points, degrees symbol, copyright symbol, Yen symbol, etc. Also, your first example includes non-printable symbols, such as BELL, which is undesirable. – SherylHohman Apr 13 '20 at 05:35

joaquin · Answer 7 · 2011-12-31T19:11:46.117

1

If you want printable ascii characters you probably should correct your code to:

if ord(char) < 32 or ord(char) > 126: return ''

this is equivalent, to string.printable (answer from @jterrace), except for the absence of returns and tabs ('\t','\n','\x0b','\x0c' and '\r') but doesnt correspond to the range on your question

edited Dec 31 '11 at 19:11

answered Dec 31 '11 at 18:50

joaquin

82,968
29
138
152

1

Slightly simpler: lambda x: 32 <= ord(x) <= 126 – jterrace Dec 31 '11 at 18:59
that's not the same as string.printable because it leaves out string.whitespace, although that might be what the OP wants, depends on things like \n and \t. – jterrace Dec 31 '11 at 19:02
@jterrace right, includes space (ord 32) but no returns and tabs – joaquin Dec 31 '11 at 19:07
yeah, just commenting on "this is equivalent to string.printable", but not true – jterrace Dec 31 '11 at 19:08
I edited the answer, thanks! the OP question is misleading if you do not read it carefully. – joaquin Dec 31 '11 at 19:12

Ahmed Sheri · Answer 8 · 2023-02-05T20:44:36.687

this is best way to get ascii characters and clean code, Checks for all possible errors

from string import printable

def getOnlyCharacters(texts):
    _type = None
    result = ''
    
    if type(texts).__name__ == 'bytes':
        _type = 'bytes'
        texts = texts.decode('utf-8','ignore')
    else:
        _type = 'str'
        texts = bytes(texts, 'utf-8').decode('utf-8', 'ignore')

    texts = str(texts)
    for text in texts:
        if text in printable:
            result += text
            
    if _type == 'bytes':
        result = result.encode('utf-8')

    return result

text = '�Ahm�����ed Sheri��'
result = getOnlyCharacters(text)

print(result)
#input --> �Ahm�����ed Sheri��
#output --> Ahmed Sheri

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). — Community, Feb 01 '23 at 17:23

How can I remove non-ASCII characters but leave periods and spaces?

8 Answers8

Linked

Related