1

I have the following description I want scrap using my program.

<hr>Provides AFROTC cadets up to 13 options for practical leadership and specialized training through exposure to USAF functions, deployments, and employment operations. Foreign language and cultural immersions also available/possible but overall emphasis remains on leadership development and practicum. All programs conducted off-site at selected Air Forces bases and other locations in the USA and abroad.<br>

I have the following code:

findDescription = re.findall('<hr>(.*?)(?:<strong>|<br>)', coursePage)

And I get the following output:

['Provides AFROTC cadets up to 13 options for practical leadership and specialized training through exposure to USAF functions, deployments, and employment operations.\xc2\xa0 Foreign language and cultural immersions also available/possible but overall emphasis remains on leadership development and practicum.\xc2\xa0 All programs conducted off-site at selected Air Forces bases and other locations in the USA and abroad.']

Why am I getting weird stuff like \xc2\xa0 in here? My code also gets tripped up with the quotation symbol ". Frankly, I believe that the period . in my regex code should accept all strings. What is going wrong?

I appreciate any quick hints. I only heard about regex on Friday and I have made tremendous progress, but this one has really tripped me up for a few hours.

Warm Regards, GeekyOmega

GeekyOmega
  • 1,235
  • 6
  • 16
  • 34
  • 3
    Parsing HTML with regular expressions is fragile. It's far more robust to use an established library like BeautifulSoup. http://www.crummy.com/software/BeautifulSoup/ – Andy Lester Feb 03 '13 at 21:20
  • 1
    Can you provide an example of your code getting 'tripped up' by double quotes? – Mark Amery Feb 03 '13 at 21:31
  • I think it is both quotations and ' symbol. Here is the ' symbol example: accountant’s – GeekyOmega Feb 03 '13 at 21:37
  • 2
    @GeekyOmega Google HTML entity encoding. Your example is a HTML entity encoded string. You need to decode it somehow (BeautifulSoup, already mentioned by AndyLester, can do this, but there are other ways too) in order to get the text as it would be displayed in a browser. Observe that if you paste your string into the box here: http://htmlentities.net/ and click 'decode', you'll see the output you expect and want. – Mark Amery Feb 03 '13 at 21:49
  • +1 to @MarkAmery. If you want to decode these in Python, there are multiple questions on SO that explain how to do this. Most of them seem to be on the Related list for [this one](http://stackoverflow.com/questions/57708/convert-xml-html-entities-into-unicode-string-in-python). – abarnert Feb 03 '13 at 22:16
  • This Q/A is a good example of how regular expressions parsing html is fragile and troublesome and generally not a good idea. Here is the fun answer: http://stackoverflow.com/a/1732454/564406. DOM parsing libraries, like those mentioned in other comments, are the best choice. – David Feb 05 '13 at 14:43

1 Answers1

5

\xC2\xA0 is the UTF-8 encoding of the unicode character 0xA0 which is usually written as &nbsp; in html files.

BeniBela
  • 16,412
  • 4
  • 45
  • 52
  • 1
    To add to this: the characters are not being introduced by calling `re.findall`. They are there in coursePage beforehand. – Mark Amery Feb 03 '13 at 21:29
  • 1
    It's also worth adding that the character in question is a 'non-breaking space': http://en.wikipedia.org/wiki/Non-breaking_space If you use `print` in Python to display a string with a non-breaking space in it, it will be displayed, reasonably enough, as a space. `print`ing a list, however, calls `repr` on each element of the list (not `print`), and calling `repr` on a utf-8 byte string containing a non-breaking space will display it in the way shown in the OP's post. This may explain his confusion and why he thinks `findall` introduced the 'weird stuff'. – Mark Amery Feb 03 '13 at 21:37
  • Interesting. I was putting it in the list, and then calling the element in the list. Ultimately, is there anyway for me to store this in txt or .csv and not have this issue? – GeekyOmega Feb 03 '13 at 21:41
  • 1
    @When you write to a file, it won't write a slash, x, C and 2; it will write the bytes that `'\xC2'` and `'\xA0'` correspond to. Just `.write` the string as it is and you'll be fine (as long as what you want is to write to the file a valid `utf-8`-encoded string containing a non-breaking space, of course. If this isn't what you want, you probably need to do some kind of find and replace. If you're *not sure* whether it's what you want, then I suspect you need to read up a bit on string encodings.). – Mark Amery Feb 03 '13 at 21:52