3

I have a text file, which I'm reading line by line. In each line, if there are special characters, then I'm removing the special characters, for this, I'm using the help of regular expressions.

fh = open(r"abc.txt","r+")
    data = fh.read()
    #print re.sub(r'\W+', '', data)
    new_str = re.sub('[^a-zA-Z0-9\n\.;,?!$]', ' ', data)

So, here in my data, I'm keeping only the alphanumeric words along with few special symbols which are [.;,?!$], but along with it I also want Euro symbol(€), pound (£), Japanese yen(¥) and Rupee symbol(₹). But these are not present in ASCII characters, so when I include them in my regular expression like - re.sub('[^a-zA-Z0-9\n.;,?!$€₹¥]', ' ', data) it gives an error message. SyntaxError: Non-ASCII character '\xe2' in file preprocess.py on line 23, but no encoding declared

Mridul Sachan
  • 93
  • 1
  • 2
  • 11
  • 1
    What encoding are you using? Are you using UTF-8? –  Feb 14 '18 at 05:55
  • https://stackoverflow.com/questions/393843/python-and-regular-expression-with-unicode looks like you're not encoding correctly – aydow Feb 14 '18 at 05:56
  • Maybe relevant: https://stackoverflow.com/questions/3170211/why-declare-unicode-by-string-in-python – user2864740 Feb 14 '18 at 05:56
  • 1
    A workaround might be to specify a Unicode range instead ([see here](https://stackoverflow.com/questions/3835917/how-do-i-specify-a-range-of-unicode-characters)). But that might not be clean if the characters you want to spare don't fit neatly into a single range. – Tim Biegeleisen Feb 14 '18 at 05:57
  • 1
    Please specify whether you are using Python 2 or 3. If Python 2, are you using the encoding line on top of the file? – Hubert Grzeskowiak Feb 14 '18 at 05:58
  • I'm not using any encoding, i already shared the code snippet here. @TWrist – Mridul Sachan Feb 14 '18 at 05:59
  • I'm using Python 2.7, and I'm not using any encoding line in top of the file. @HubertGrzeskowiak – Mridul Sachan Feb 14 '18 at 06:00
  • Just be aware as was mentioned that in python3 all strings are UTF-8. If you are using raw bytes then you need to use the b' prefix. –  Feb 14 '18 at 06:05

2 Answers2

0

You can make use of Unicode character escapes. For example, the Euro character above can be represented as \u20ac. The four digit number is the Unicode number, irrespective of encoding types. In an example regex, this might look like:

[^a-zA-Z0-9\u20ac]
entpnerd
  • 10,049
  • 8
  • 47
  • 68
0

Maybe not the solution, but potentially a partial solution. Use this as the first two lines of each of your Python 2 files:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

This makes Python 2 switch to UTF-8 (unicode) mode. In Python 3 this is the default.

Hubert Grzeskowiak
  • 15,137
  • 5
  • 57
  • 74