1

I am running this code using Python 3.2.3:

regex = '<title>(.+?)</title>'
pattern = re.compile(regex)

and then searching the pattern using findall:

titles = re.findall(pattern,html)
print(titles)

html object gets html code from a specific url.

html = response.read()

I get the error "Can't use string pattern on a byte-like object". I have tried using:

regex = b'<title>(.+?)</title>'

but that appends a "b" to my results? Thanks.

Nikhil
  • 11
  • 1
  • 3

1 Answers1

2

urllib.request responses give you bytes, not unicode strings. That's why the re pattern needs to be a bytes object too, and you get bytes results back again.

You can decode the response using the encoding the server gave you on in the HTTP headers:

html = response.read()
# no codec set? We default to UTF-8 instead, a reasonable assumption
codec = response.info().get_param('charset', 'utf8')
html = html.decode(codec)

Now you have Unicode and can use unicode regular expressions too.

The above can still lead to UnicodeDecodeException errors if the server lied about the encoding or there was no encoding set and the default of UTF-8 was incorrect too.

In any case, return values represented with b'...' are bytes objects; raw string data not yet decoded to Unicode, and are nothing to worry about if you know the right encoding of the data.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • This typifies a general rule when reading and writing string data: decode your inputs to Unicode when you read them, encode your Unicode strings before you write them. All text inside your program should be handled in Unicode. – holdenweb Feb 27 '14 at 23:24