0

I have some simple python code that makes request to a server

html_page = urllib2.urlopen(baseurl, timeout=20)
print html_page.read()
html_page.close()

when i am trying to scrape a page that has a '-'(dash) character in it. It is a dash in the browser, but when i try to print out the request of the response of urlopen it prints out as '?'. I tried recreating the html page with a local file, copying the afflicted text over from source, but I could not recreate it.

What other factors/variables might be in play? Could this have something to do with encoding?

UPDATE: I now know this problem is about encoding. the website i encoded in 'iso-8859-1'. the problem is i still cannot decode it, even after following Python: Converting from ISO-8859-1/latin1 to UTF-8

The character, when decoded, gives me:

>>>text.decode("iso-8859-1")
  u"</strong><p>Let's\x97in "
>>> text.decode("iso-8859-1").encode("utf8")
  "</strong><p>Let's\xc2\x97in "
>>> print text.decode("iso-8859-1").encode("utf8")
  </strong><p>Let'sin

The character just completely disappears. Anyone have any ideas?

Community
  • 1
  • 1
Ying
  • 1,944
  • 5
  • 24
  • 38
  • Might it be [an emdash or an endash](http://en.wikipedia.org/wiki/Dash#Common_dashes)? – cdhowie Jul 26 '12 at 17:39
  • checked, it is the encoding but still have a problem. see my update, and thanks for the help! – Ying Jul 26 '12 at 21:13
  • 1
    That's an "em dash", which is Unicode code point U+2014 and is encoded in Windows-1252 as 0x97 (but it is *not* part of ISO 8859-1). – Adam Rosenfield Jul 26 '12 at 21:20
  • @Adam Rosenfeld, wow, that really helped! The site identified the charset as iso8859-1, so thats really misleading! – Ying Jul 26 '12 at 22:40

1 Answers1

1

So thanks to Adam Rosenfield, I figured out my problem. The website indicated the charset was iso-8859-1

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

bu! the character I had an issue with was "em dash", encoded in Windows-1252

>>> text.decode("windows-1252")
  </strong><p>Let's\u2014in"
>>> print text.decode("windows-1252")
  </strong><p>Let's—in

Thanks guys!

Ying
  • 1,944
  • 5
  • 24
  • 38
  • Yep. [Quoth Wikipedia](http://en.wikipedia.org/wiki/Windows-1252): "It is very common to mislabel Windows-1252 text with the charset label ISO-8859-1. A common result was that all the quotes and apostrophes (produced by "smart quotes" in Microsoft software) were replaced with question marks or boxes on non-Windows operating systems, making text difficult to read. Most modern web browsers and e-mail clients treat the MIME charset ISO-8859-1 as Windows-1252 in order to accommodate such mislabeling." – Adam Rosenfield Jul 29 '12 at 16:40