use urllib to search source code

Question

I'm trying to write a script that will search for text in a websites source code. I have it so it successfully grabs the source code and prints it out, and looks something like: b'<?xml version="1.0" encoding="UTF-8" ?>\n<!DOCTYPE html ... and so on

However, when trying to search to find the 'div' tags in the code using print(page.find('div')), I get an error stating TypeError: Type str doesn't support the buffer API I believe this has to do with the fact that I am receiving a byte literal. How do I encode this as UTF-8 or ASCII to be able to search for a string?

If needed, here is the simple code I am running:

import urllib.request
from urllib.error import URLError

def get_page(url):
  #make the request
  req = urllib.request.Request(url)
  the_page = urllib.request.urlopen(req)

  #get the results of the request
  try:
    #read the page
    page = the_page.read()
    print(page)
    print(page.find('div'))

  #except error
  except URLError as e:
    #if error has a reason (thus is url error) print the reason
    if hasattr(e, 'reason'):
      print(e.reason)
    #if error has a code (thus is html error) print the code and the error
    if hasattr(e, 'code'):
      print(e.code)
      print(e.read())

Why don't you just parse the xml instead of treating the page as a string? — Diego Basch, Jan 14 '13 at 17:43
@CRUSADER: That's sort of a rude and unconstructive reaction. — David Robinson, Jan 14 '13 at 17:46
@CRUSADER why is it bad to parse html with a string operation? — Hat, Jan 14 '13 at 18:05
@flapjacks [It isn't always.](http://stackoverflow.com/a/1733489/530160) — Nick ODell, Jan 14 '13 at 18:17

score 0 · Answer 1 · answered Jan 15 '13 at 12:04

0

I figure you are using Python v.3 (as stated from print as a function, not a statement).

In Python 3, page is a bytes object. So you need to search it using a bytes object too. Try this one:

print(page.find(b'div'))

Hope this can help

answered Jan 15 '13 at 12:04

ldfa

28
5

use urllib to search source code

1 Answers1