How to include Non-ascii characters in regular expression in Python

Question

I have a text file, which I'm reading line by line. In each line, if there are special characters, then I'm removing the special characters, for this, I'm using the help of regular expressions.

fh = open(r"abc.txt","r+")
    data = fh.read()
    #print re.sub(r'\W+', '', data)
    new_str = re.sub('[^a-zA-Z0-9\n\.;,?!$]', ' ', data)

So, here in my data, I'm keeping only the alphanumeric words along with few special symbols which are [.;,?!$], but along with it I also want Euro symbol(€), pound (£), Japanese yen(¥) and Rupee symbol(₹). But these are not present in ASCII characters, so when I include them in my regular expression like - re.sub('[^a-zA-Z0-9\n.;,?!$€₹¥]', ' ', data) it gives an error message. SyntaxError: Non-ASCII character '\xe2' in file preprocess.py on line 23, but no encoding declared

https://stackoverflow.com/questions/393843/python-and-regular-expression-with-unicode looks like you're not encoding correctly — aydow, Feb 14 '18 at 05:56
Maybe relevant: https://stackoverflow.com/questions/3170211/why-declare-unicode-by-string-in-python — user2864740, Feb 14 '18 at 05:56
A workaround might be to specify a Unicode range instead ([see here](https://stackoverflow.com/questions/3835917/how-do-i-specify-a-range-of-unicode-characters)). But that might not be clean if the characters you want to spare don't fit neatly into a single range. — Tim Biegeleisen, Feb 14 '18 at 05:57
Please specify whether you are using Python 2 or 3. If Python 2, are you using the encoding line on top of the file? — Hubert Grzeskowiak, Feb 14 '18 at 05:58
I'm not using any encoding, i already shared the code snippet here. @TWrist — Mridul Sachan, Feb 14 '18 at 05:59
I'm using Python 2.7, and I'm not using any encoding line in top of the file. @HubertGrzeskowiak — Mridul Sachan, Feb 14 '18 at 06:00
Just be aware as was mentioned that in python3 all strings are UTF-8. If you are using raw bytes then you need to use the b' prefix. — , Feb 14 '18 at 06:05

score 0 · Answer 1 · answered Feb 14 '18 at 05:57

0

You can make use of Unicode character escapes. For example, the Euro character above can be represented as \u20ac. The four digit number is the Unicode number, irrespective of encoding types. In an example regex, this might look like:

[^a-zA-Z0-9\u20ac]

answered Feb 14 '18 at 05:57

entpnerd

10,049
8
47
68

I already tried doing this way, but this method is not working. – Mridul Sachan Feb 14 '18 at 06:02
When you say it "is not working", do you mean your regex doesn't match or you are still getting the `SyntaxError: Non-ASCII character '\xe2' in file preprocess.py on line 23`? – entpnerd Feb 14 '18 at 06:10
I'm not getting any error, but it is not keeping that special symbol. In output that symbol like Euro is removed. – Mridul Sachan Feb 14 '18 at 06:15
Try using `re.sub('', ' ', data, 0, re.UNICODE)` – entpnerd Feb 14 '18 at 06:27
It's working in Python 3 BUT NOT WORKING IN PYTHON 2. – Mridul Sachan Feb 14 '18 at 08:24

score 0 · Answer 2 · answered Feb 14 '18 at 06:02

0

Maybe not the solution, but potentially a partial solution. Use this as the first two lines of each of your Python 2 files:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

This makes Python 2 switch to UTF-8 (unicode) mode. In Python 3 this is the default.

answered Feb 14 '18 at 06:02

Hubert Grzeskowiak

15,137
5
57
74

How to include Non-ascii characters in regular expression in Python

2 Answers2