1

I am trying to scrape a site using this code

    #!/usr/bin/python
    #coding = utf-8
    import urllib, urllib2
    req = urllib2.Request(‘http://some website’)
    req.add_header('User-agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')
    f = urllib2.urlopen(req) 
    body = f.read()
    f.close()

This is part of the document returned by the read() method

    T\u00f3m l\u01b0\u1ee3c di\u1ec5n ti\u1ebfn Th\u01b0\u1ee3ng H\u1ed9i \u0110\u1ed3ng Gi\u00e1m M\u1ee5c v\u1ec1 Gia \u0110\u00ecnh\

How can I change the above code to get the result like this?

    Tóm lược diễn tiến Thượng Hội Đồng Giám Mục về Gia Đình

Thank you.

My issue is solved by using mata's advice. Here the code that works for me. Thank you everyone for helping, especially mata.

 #!/usr/bin/python
#coding = utf-8
import urllib, urllib2
req = urllib2.Request(‘http://some website’)
req.add_header('User-agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')
f = urllib2.urlopen(req) 
body = f.read().decode('unicode-escape').encode('utf-8')
f.close()
H123
  • 13
  • 2
  • Have you checked similar questions to see if they help? like [this one](http://stackoverflow.com/questions/701704/convert-html-entities-to-unicode-and-vice-versa) – user2464424 Feb 16 '16 at 12:55
  • python2 wouldn't produce unicode escapes in a string returned by read() if they weren't there literally, so could it be that the document contains unicode escapes (might be JSON)? In that case the `json` module module might be helpful, or try `body.decode('unicode-escape')` – mata Feb 16 '16 at 13:03
  • beautifulsoup is good for this sort of thing – Vorsprung Feb 16 '16 at 13:09
  • Thanks everyone for helping. My issue is solved by using mata's advice. – H123 Feb 16 '16 at 20:34

2 Answers2

1

you need to detect the encoding of the page the decode it, try using this lib for the encoding detection http://github.com/chardet/chardet se the usage help and example at http://chardet.readthedocs.org/en/latest/usage.html

pip install chardet

then use it

import urllib, urllib2
import chardet  #<- import this lib

req = urllib2.Request(‘http://some website’)
req.add_header('User-agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')
f = urllib2.urlopen(req) 
body = f.read()
f.close()

code = chardet.detect(body)           #<- detect the encoding
body = body.decode(code['encoding'])  #<- decode
efirvida
  • 4,592
  • 3
  • 42
  • 68
1

You must detect encoding from page. This info, in most cases, comes in request's header.

#!/usr/bin/python
#coding = utf-8

import cgi
import urllib2

req = urllib2.Request("http://some website")
req.add_header("User-agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36")
f = urllib2.urlopen(req)
encoding = f.headers.getparam('charset') # Here, you will detect the page encoding
body = f.read().decode(encoding) # Here you will define which encode use to decode data.
f.close()

There are another ways to get same result, but I just adapted to your approach.

Mauro Baraldi
  • 6,346
  • 2
  • 32
  • 43