What is the correct way to use unicode characters in a python regex

Question

In the process of scraping some documents using Python 2.7, I've run into some annoying page separators, which I've decided to remove. The separators use some funky characters. I already asked one question here on how to make these characters reveal their utf-8 codes. There are two non-ASCII characters used: '\xc2\xad', and '\x0c'. Now, I just need to remove these characters, as well some spaces and the page numbers.

Elsewhere on SO, I've seen unicode characters used in tandem with regexps, but it's in a strange format that I do not have these characters in, e.g. '\u00ab'. In addition, none of them are using ASCII as well as non-ASCII characters. Finally, the python docs are very light on the subject of unicode in regexes... something about flags... I don't know. Can anyone help?

Here is my current usage, which does not do what I want:

re.sub('\\xc2\\xad\s\d+\s\\xc2\\xad\s\\x0c', '', my_str)

I guess it would be helpful to point you toward [Joel](http://www.joelonsoftware.com/articles/Unicode.html) and [deceze](http://kunststube.net/encoding/) — georg, Sep 25 '13 at 09:31
Read the Joel before. So should I infer that the difficulty I'm having is just my confusion about what unicode is? — Brian Peterson, Sep 25 '13 at 09:33
It can be. Could you describe your input more precisely (e.g. what `repr(my_str)` says)? — georg, Sep 25 '13 at 09:42
... '\xc2\xa0Nasim\xc2\xa0F.\xc2\xa0Khan,\xc2\xa0Beaumont,\xc2\xa0TX\xc2\xa0\xe2\x80\x93 physician\xc2\xa0and\xc2\xa0surgeon\xc2\xa0license\xc2\xa0(036\xc2\xad065256)\xc2\xa0placed\xc2\xa0in\xc2\xa0refuse\xc2\xa0to\xc2\xa0renew\xc2\xa0status\xc2\xa0after\xc2\xa0being\xc2\xa0disciplined\xc2\xa0in\xc2\xa0the\xc2\xa0state\xc2\xa0of\xc2\xa0Texas.\xc2\xad\xc2\xa09 \xc2\xad\xc2\xa0\x0cSantosh\xc2\xa0P.\xc2\xa0Kumari\xc2\xa0a/k/a\xc2\xa0Chand,\xc2\xa0Fairview\xc2\xa0Heights\xc2\xa0\xe2\x80\x93\xc2\xa0physician\xc2\xa0and\xc2\xa0surgeon\xc2\xa0license\xc2\xa0(036\xc2\xad062976)' ... — Brian Peterson, Sep 25 '13 at 09:51
It's a 25000 character string, the contents of a text file. I've included above one example of where a page separator used to be in the original PDF. I guess it's... '\xc2\xad\xc2\xa09 \xc2\xad\xc2\xa0\x0c', or about that much that I was trying to remove. You can find it by searching for '\x0c'. — Brian Peterson, Sep 25 '13 at 09:52
Ok, it appears to be an utf8-encoded byte string. So your options are either 1) replace verbatim _bytes_ in that string or 2) convert it to unicode and replace _characters_. — georg, Sep 25 '13 at 09:58
I think I've converted my string to unicode, via `my_str = my_str.decode('utf-8')`. Is the problem just with my regex? I could match for the exact unicode escape characters, if that's what you mean, @Veedrac. Given that I'm switching to an all-escape regex, though, what should the digit in the middle become? Still '\d'? — Brian Peterson, Sep 25 '13 at 19:20
I tried your search string and it works for me if I feed it what I think it's looking for. I really do think you just need to tweak the regular expression. Also I'd double up *all* the `\ ` to make sure they're part of the final expression, or use `r''` notation instead. — Mark Ransom, Sep 25 '13 at 19:47
@thg425 For the record, the following discussion of unicode was much more helpful than the Joel or deceze, though having a bit of background was good: http://nedbatchelder.com/text/unipain.html — Brian Peterson, Sep 29 '13 at 00:28

score 2 · Answer 1 · answered Sep 25 '13 at 15:55

2

Rather than seek out specific unwanted chars, you could remove everything not wanted:

re.sub('[^\\s!-~]', '', my_str)

This throws away all characters not:

whitespace (spaces, tabs, newlines, etc)
printable "normal" ascii characters (! is the first printable char and ~ is the last under decimal 128)

You could include more chars if needed - just adjust the character class.

answered Sep 25 '13 at 15:55

Bohemian

412,405
93
575
722

This is smart. The only problem is that the so-called 'soft hyphen', '–', is used over and over again, and is part of my regex for capturing data. At the same time, it is also part of what I was hoping to remove. Sometimes, the OCR technology inserted page breaks that look like, e.g., '– 9 –\x0c'. Usually, the breaks are found in between the data I'm trying to capture. Occasionally, though, it comes right in the middle of a sentence. Thus, I AM only looking for specific instances... – Brian Peterson Sep 25 '13 at 19:05
Perhaps, though, I could do an initial sweep through the document, and replace all instances of '–' with '--'. This would also convert the specific instances I'm now trying to remove. I could just drop all instances of '\x0c' as well, and then I have a simple, pure 1-byte regex to deal with, and sidestep the unicode regex. – Brian Peterson Sep 25 '13 at 19:09

score 0 · Answer 2 · answered Aug 14 '18 at 11:04

0

i have same problem, i know this in not efficient way but in my case worked

 result = re.sub(r"\\" ,",x,x",result)
 result = re.sub(r",x,xu00ad" ,"",result)
 result = re.sub(r",x,xu" ,"\\u",result)

answered Aug 14 '18 at 11:04

Nozar Safari

505
4
17

What is the correct way to use unicode characters in a python regex

2 Answers2

Linked