I'm new to python, and trying to use urllib2/lxml to fetch, and parse a page. Everything seems to work fine, except, the parsed page, when opened in my browser seems to have strange characters embedded in it. I'm guessing this is a unicode/lxml parsing problem. When I get the text content of an element, using .text_content(), and print it, I get stuff like: "sometext \342\200\223 moretext" in the original page, this shows as "sometext - moretext"
Could anyone tell me:
1. what's going on?
2. how do I fix it?
3. where can I read up on encoding issues like these?
Thanks!