3

In the process of scraping some documents using Python 2.7, I've run into some annoying page separators, which I've decided to remove. The separators use some funky characters. I already asked one question here on how to make these characters reveal their utf-8 codes. There are two non-ASCII characters used: '\xc2\xad', and '\x0c'. Now, I just need to remove these characters, as well some spaces and the page numbers.

Elsewhere on SO, I've seen unicode characters used in tandem with regexps, but it's in a strange format that I do not have these characters in, e.g. '\u00ab'. In addition, none of them are using ASCII as well as non-ASCII characters. Finally, the python docs are very light on the subject of unicode in regexes... something about flags... I don't know. Can anyone help?

Here is my current usage, which does not do what I want:

re.sub('\\xc2\\xad\s\d+\s\\xc2\\xad\s\\x0c', '', my_str)
Community
  • 1
  • 1
Brian Peterson
  • 2,800
  • 6
  • 29
  • 36
  • I guess it would be helpful to point you toward [Joel](http://www.joelonsoftware.com/articles/Unicode.html) and [deceze](http://kunststube.net/encoding/) – georg Sep 25 '13 at 09:31
  • Read the Joel before. So should I infer that the difficulty I'm having is just my confusion about what unicode is? – Brian Peterson Sep 25 '13 at 09:33
  • It can be. Could you describe your input more precisely (e.g. what `repr(my_str)` says)? – georg Sep 25 '13 at 09:42
  • ... '\xc2\xa0Nasim\xc2\xa0F.\xc2\xa0Khan,\xc2\xa0Beaumont,\xc2\xa0TX\xc2\xa0\xe2\x80\x93 physician\xc2\xa0and\xc2\xa0surgeon\xc2\xa0license\xc2\xa0(036\xc2\xad065256)\xc2\xa0placed\xc2\xa0in\xc2\xa0refuse\xc2\xa0to\xc2\xa0renew\xc2\xa0status\xc2\xa0after\xc2\xa0being\xc2\xa0disciplined\xc2\xa0in\xc2\xa0the\xc2\xa0state\xc2\xa0of\xc2\xa0Texas.\xc2\xad\xc2\xa09 \xc2\xad\xc2\xa0\x0cSantosh\xc2\xa0P.\xc2\xa0Kumari\xc2\xa0a/k/a\xc2\xa0Chand,\xc2\xa0Fairview\xc2\xa0Heights\xc2\xa0\xe2\x80\x93\xc2\xa0physician\xc2\xa0and\xc2\xa0surgeon\xc2\xa0license\xc2\xa0(036\xc2\xad062976)' ... – Brian Peterson Sep 25 '13 at 09:51
  • It's a 25000 character string, the contents of a text file. I've included above one example of where a page separator used to be in the original PDF. I guess it's... '\xc2\xad\xc2\xa09 \xc2\xad\xc2\xa0\x0c', or about that much that I was trying to remove. You can find it by searching for '\x0c'. – Brian Peterson Sep 25 '13 at 09:52
  • 4
    Ok, it appears to be an utf8-encoded byte string. So your options are either 1) replace verbatim _bytes_ in that string or 2) convert it to unicode and replace _characters_. – georg Sep 25 '13 at 09:58
  • 1
    Look out for zero-width spaces there! – Veedrac Sep 25 '13 at 15:05
  • I think I've converted my string to unicode, via `my_str = my_str.decode('utf-8')`. Is the problem just with my regex? I could match for the exact unicode escape characters, if that's what you mean, @Veedrac. Given that I'm switching to an all-escape regex, though, what should the digit in the middle become? Still '\d'? – Brian Peterson Sep 25 '13 at 19:20
  • I tried your search string and it works for me if I feed it what I think it's looking for. I really do think you just need to tweak the regular expression. Also I'd double up *all* the `\ ` to make sure they're part of the final expression, or use `r''` notation instead. – Mark Ransom Sep 25 '13 at 19:47
  • @thg425 For the record, the following discussion of unicode was much more helpful than the Joel or deceze, though having a bit of background was good: http://nedbatchelder.com/text/unipain.html – Brian Peterson Sep 29 '13 at 00:28

2 Answers2

2

Rather than seek out specific unwanted chars, you could remove everything not wanted:

re.sub('[^\\s!-~]', '', my_str)

This throws away all characters not:

  • whitespace (spaces, tabs, newlines, etc)
  • printable "normal" ascii characters (! is the first printable char and ~ is the last under decimal 128)

You could include more chars if needed - just adjust the character class.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • This is smart. The only problem is that the so-called 'soft hyphen', '–', is used over and over again, and is part of my regex for capturing data. At the same time, it is also part of what I was hoping to remove. Sometimes, the OCR technology inserted page breaks that look like, e.g., '– 9 –\x0c'. Usually, the breaks are found in between the data I'm trying to capture. Occasionally, though, it comes right in the middle of a sentence. Thus, I AM only looking for specific instances... – Brian Peterson Sep 25 '13 at 19:05
  • Perhaps, though, I could do an initial sweep through the document, and replace all instances of '–' with '--'. This would also convert the specific instances I'm now trying to remove. I could just drop all instances of '\x0c' as well, and then I have a simple, pure 1-byte regex to deal with, and sidestep the unicode regex. – Brian Peterson Sep 25 '13 at 19:09
0

i have same problem, i know this in not efficient way but in my case worked

 result = re.sub(r"\\" ,",x,x",result)
 result = re.sub(r",x,xu00ad" ,"",result)
 result = re.sub(r",x,xu" ,"\\u",result)
Nozar Safari
  • 505
  • 4
  • 17