Python UTF-8 REGEX

Question

I have a problem while trying to find text specified in regex. Everything work perfectly fine but when i added "\£" to my regex it started causing problems. I get SyntaxError. "NON ASCII CHACTER "\xc2" in file (...) but no encoding declared...

I've tried to solve this problem with using

import sys
reload(sys)  # to enable `setdefaultencoding` again
sys.setdefaultencoding("UTF-8")

but it doesnt help. I just want to build regular expression and use pound sign there. flag re.Unicode flag doesnt help, saving string as unicode (pat) doesnt help. Is there any solution to fix this regex? I just want to build regular expression and use pound sign there.Thanks for help.

                    k = text.encode('utf-8')
                    pat = u'salar.{1,6}?([0-9\-,\. \tkFFRroOMmTtAanNuUMm\$\&\;\£]{2,})'
                    pattern = re.compile(pat, flags = re.DOTALL|re.I|re.UNICODE)
                    salary =  pattern.search(k).group(1)
                    print (salary)

Error is still there even if I comment(put "#" and skip all of those lines. Maybe its not connected with re. library but my settings?

What Python version are you using? Check if [this answer](http://stackoverflow.com/questions/33127900/can-the-a-za-z-python-regex-pattern-be-made-to-match-and-replace-non-ascii-uni/33128359#33128359) works for you. Or [this one](http://stackoverflow.com/questions/32863608/regex-python-with-unicode-japanese-character-issue/32868484#32868484). Or [yet another one](http://stackoverflow.com/questions/32575617/using-unicode-hebrew-characters-with-regular-expression/32583571#32583571). — Wiktor Stribiżew, Nov 25 '15 at 10:53
i copied example from first and second answer but still doesnt work. I am working on windows, they put "# -*- coding: utf-8 -*-" line in their script. Could I translate it somehow to windows? — jawjaw, Nov 25 '15 at 11:00
Why are you encoding to bytes and then searching with a `unicode` pattern? — Ignacio Vazquez-Abrams, Nov 25 '15 at 11:06
Python 3 assumes source code is in UTF-8 by default, and Py3 strings are unicode. The language is a lot cleaner than Py2 in many ways. I strongly suggest you upgrade if you can. — Tom Zych, Nov 25 '15 at 11:09
@jawjaw Yes, by all means, set your source coding as utf-8. Also, if that's the only character, you could change it to `\xa3` in your regex — Mariano, Nov 25 '15 at 11:11

tripleee · Accepted Answer · 2015-11-25T11:59:39.670

6

The error message means Python cannot guess which character set you are using. It also tells you that you can fix it by telling it the encoding of your script.

# coding: utf-8
string = "£"

or equivalently

string = u"\u00a3"

Without an encoding declaration, Python sees a bunch of bytes which mean different things in different encodings. Rather than guess, it forces you to tell you what they mean. This is codified in PEP-263.

(ASCII is unambiguous [except if your system is EBCDIC I guess] so it knows what you mean if you use a pure-ASCII representation for everything.)

The encoding settings you were fiddling with affect how files and streams are read, and program I/O generally, but not how the program source is interpreted.

edited Nov 25 '15 at 11:59

answered Nov 25 '15 at 11:06

tripleee

175,061
34
275
318

Adding " # coding: utf-8" helps. Great quick help. THANKS! – jawjaw Nov 25 '15 at 11:07
Of course, I was just lucky to guess that you are using UTF-8 for your script as well. On Windows, many text editors will still save files in legacy encodings (and braindamagedly call it "ANSI" which is not true or helpful at all). You have to know which encoding the file actually uses in order to get it right. – tripleee Nov 25 '15 at 11:09
1

... and in fact you shoul avoid `setdefaultencoding` here: http://stackoverflow.com/questions/28657010/dangers-of-sys-setdefaultencodingutf-8 – tripleee Nov 25 '15 at 11:52

Python UTF-8 REGEX

1 Answers1