68

I'm trying to open a webpage using urllib.request.urlopen() then search it with regular expressions, but that gives the following error:

TypeError: can't use a string pattern on a bytes-like object

I understand why, urllib.request.urlopen() returns a bytestream, so re doesn't know the encoding to use. What am I supposed to do in this situation? Is there a way to specify the encoding method in a urlrequest maybe or will I need to re-encode the string myself? If so what am I looking to do, I assume I should read the encoding from the header info or the encoding type if specified in the html and then re-encode it to that?

smci
  • 32,567
  • 20
  • 113
  • 146
kryptobs2000
  • 3,289
  • 3
  • 27
  • 30
  • 1
    not one of these answers work for me in Python 3.5x using urllib.request because urllib.request.urlopen(url) literally returns ONLY a byte stream - it has NO member functions to parse any form of header in the html. So no info(), no headers, etc. I'd have to parse it myself to find the encoding, but without the encoding I can't convert it to text to parse it. It's a catch 22. – user2465201 Dec 19 '16 at 22:02

7 Answers7

113

As for me, the solution is as following (python3):

resource = urllib.request.urlopen(an_url)
content =  resource.read().decode(resource.headers.get_content_charset())
Ivan Klass
  • 6,407
  • 3
  • 30
  • 28
  • 8
    Looks like the best answer but what if the server doesn't send the charset info? – rvighne Jul 16 '14 at 18:05
  • If the server doesn't send charset info your best bet at that point is to guess. – Iguananaut Aug 06 '14 at 16:30
  • 11
    @rvighne: if the server doesn't pass `charset` in `Content-Type` header then [there are complex rules to figure out the character encoding](https://blog.whatwg.org/the-road-to-html-5-character-encoding) e.g., it may be specified inside html document: ``. – jfs Oct 22 '14 at 04:38
66

You just need to decode the response, using the Content-Type header typically the last value. There is an example given in the tutorial too.

output = response.decode('utf-8')
Senthil Kumaran
  • 54,681
  • 14
  • 94
  • 131
  • 24
    What if the charset is not utf-8? Would it be a better idea to somehow determine it from the response instead of hard-coding this assumption? – Elias Zamaria Jun 23 '14 at 17:56
  • The `Content-Type` header on the response includes the `charset` value, which is what you need to properly decode the response (at least, before [guessing](https://blog.whatwg.org/the-road-to-html-5-character-encoding) `utf-8`). For example: `Content-Type: text/html; charset=utf-8` – Dolph Sep 19 '18 at 21:04
10

I had the same issues for the last two days. I finally have a solution. I'm using the info() method of the object returned by urlopen():

req=urllib.request.urlopen(URL)
charset=req.info().get_content_charset()
content=req.read().decode(charset)
Glenn
  • 8,932
  • 2
  • 41
  • 54
pytohs
  • 117
  • 1
  • 5
  • 5
    this is exactly the same answer that Ivan Klass posted 2 years before, except using `info` instead of `headers`. :-/ With no explanation as to why pick this instead of that, this answer looks like a duplicate to me. – msb Dec 29 '18 at 01:18
6

With requests:

import requests

response = requests.get(URL).text
xged
  • 1,207
  • 1
  • 14
  • 20
5

Here is an example simple http request (that I tested and works)...

address = "http://stackoverflow.com"    
urllib.request.urlopen(address).read().decode('utf-8')

Make sure to read the documentation.

https://docs.python.org/3/library/urllib.request.html

If you want to do something more detailed GET/POST REQUEST.

import urllib.request
# HTTP REQUEST of some address
def REQUEST(address):
    req = urllib.request.Request(address)
    req.add_header('User-Agent', 'NAME (Linux/MacOS; FROM, USA)')
    response = urllib.request.urlopen(req)
    html = response.read().decode('utf-8')  # make sure its all text not binary
    print("REQUEST (ONLINE): " + address)
    return html
Asher
  • 2,638
  • 6
  • 31
  • 41
  • 1
    Does this not have the same issue as the accepted answer? To quote a comment from there: _What if the charset is not utf-8? Would it be a better idea to somehow determine it from the response instead of hard-coding this assumption?_ – AMC Jul 22 '20 at 21:12
1
urllib.urlopen(url).headers.getheader('Content-Type')

Will output something like this:

text/html; charset=utf-8

Brian Deragon
  • 2,929
  • 24
  • 44
wynemo
  • 2,293
  • 2
  • 19
  • 10
-3

after you make a request req = urllib.request.urlopen(...) you have to read the request by calling html_string = req.read() that will give you the string response that you can then parse the way you want.

Jesse Cohen
  • 4,010
  • 22
  • 25