How to handle response encoding from urllib.request.urlopen() , to avoid TypeError: can't use a string pattern on a bytes-like object

Question

I'm trying to open a webpage using urllib.request.urlopen() then search it with regular expressions, but that gives the following error:

TypeError: can't use a string pattern on a bytes-like object

I understand why, urllib.request.urlopen() returns a bytestream, so re doesn't know the encoding to use. What am I supposed to do in this situation? Is there a way to specify the encoding method in a urlrequest maybe or will I need to re-encode the string myself? If so what am I looking to do, I assume I should read the encoding from the header info or the encoding type if specified in the html and then re-encode it to that?

not one of these answers work for me in Python 3.5x using urllib.request because urllib.request.urlopen(url) literally returns ONLY a byte stream - it has NO member functions to parse any form of header in the html. So no info(), no headers, etc. I'd have to parse it myself to find the encoding, but without the encoding I can't convert it to text to parse it. It's a catch 22. — user2465201, Dec 19 '16 at 22:02

score 113 · Answer 1 · answered Oct 03 '13 at 09:54

113

As for me, the solution is as following (python3):

resource = urllib.request.urlopen(an_url)
content =  resource.read().decode(resource.headers.get_content_charset())

answered Oct 03 '13 at 09:54

Ivan Klass

6,407
3
30
28

8

Looks like the best answer but what if the server doesn't send the charset info? – rvighne Jul 16 '14 at 18:05
If the server doesn't send charset info your best bet at that point is to guess. – Iguananaut Aug 06 '14 at 16:30
11

@rvighne: if the server doesn't pass `charset` in `Content-Type` header then [there are complex rules to figure out the character encoding](https://blog.whatwg.org/the-road-to-html-5-character-encoding) e.g., it may be specified inside html document: ``. – jfs Oct 22 '14 at 04:38

score 66 · Accepted Answer · answered Feb 13 '11 at 02:09

66

You just need to decode the response, using the Content-Type header typically the last value. There is an example given in the tutorial too.

output = response.decode('utf-8')

answered Feb 13 '11 at 02:09

Senthil Kumaran

54,681
14
94
131

24

What if the charset is not utf-8? Would it be a better idea to somehow determine it from the response instead of hard-coding this assumption? – Elias Zamaria Jun 23 '14 at 17:56
The `Content-Type` header on the response includes the `charset` value, which is what you need to properly decode the response (at least, before [guessing](https://blog.whatwg.org/the-road-to-html-5-character-encoding) `utf-8`). For example: `Content-Type: text/html; charset=utf-8` – Dolph Sep 19 '18 at 21:04

score 10 · Answer 3 · edited Nov 17 '15 at 12:50

10

I had the same issues for the last two days. I finally have a solution. I'm using the info() method of the object returned by urlopen():

req=urllib.request.urlopen(URL)
charset=req.info().get_content_charset()
content=req.read().decode(charset)

edited Nov 17 '15 at 12:50

Glenn

8,932
2
41
54

answered Nov 17 '15 at 12:41

pytohs

117
1
5

5

this is exactly the same answer that Ivan Klass posted 2 years before, except using `info` instead of `headers`. :-/ With no explanation as to why pick this instead of that, this answer looks like a duplicate to me. – msb Dec 29 '18 at 01:18

xged · Answer 4 · 2016-05-24T09:44:48.260

6

With requests:

import requests

response = requests.get(URL).text

edited May 24 '16 at 09:44

answered Apr 28 '16 at 09:18

xged

1,207
1
14
20

6

This is using a different library entirely. – AMC Jul 22 '20 at 21:11

score 5 · Answer 5 · answered Dec 13 '19 at 06:18

Here is an example simple http request (that I tested and works)...

address = "http://stackoverflow.com"    
urllib.request.urlopen(address).read().decode('utf-8')

Make sure to read the documentation.

https://docs.python.org/3/library/urllib.request.html

If you want to do something more detailed GET/POST REQUEST.

import urllib.request
# HTTP REQUEST of some address
def REQUEST(address):
    req = urllib.request.Request(address)
    req.add_header('User-Agent', 'NAME (Linux/MacOS; FROM, USA)')
    response = urllib.request.urlopen(req)
    html = response.read().decode('utf-8')  # make sure its all text not binary
    print("REQUEST (ONLINE): " + address)
    return html

Does this not have the same issue as the accepted answer? To quote a comment from there: _What if the charset is not utf-8? Would it be a better idea to somehow determine it from the response instead of hard-coding this assumption?_ — AMC, Jul 22 '20 at 21:12

score 1 · Answer 6 · edited Dec 01 '11 at 17:08

1

urllib.urlopen(url).headers.getheader('Content-Type')

Will output something like this:

text/html; charset=utf-8

edited Dec 01 '11 at 17:08

Brian Deragon

2,929
24
44

answered Dec 01 '11 at 16:48

wynemo

2,293
2
19
10

score -3 · Answer 7 · answered Feb 13 '11 at 02:09

-3

after you make a request req = urllib.request.urlopen(...) you have to read the request by calling html_string = req.read() that will give you the string response that you can then parse the way you want.

answered Feb 13 '11 at 02:09

Jesse Cohen

4,010
22
25

1

I do, that's how I get it, but it returns a bytesteam, b'...'. – kryptobs2000 Feb 13 '11 at 02:10
1

i see, then you can use `.decode()` as @Senthil pointed out or you can use urllib2 which should handle this transparently to you. – Jesse Cohen Feb 13 '11 at 02:13

How to handle response encoding from urllib.request.urlopen() , to avoid TypeError: can't use a string pattern on a bytes-like object

7 Answers7

Linked

Related