-1

This is a code with Web crawler. I'm a beginer in learning python.So i don't know how to solve. It seems wrong with search()

# -*- coding:utf-8 -*-
import urllib,urllib2,re
class BDTB:
    def __init__(self,baseUrl,seeLZ):
        self.baseUrl = baseUrl
        self.seeLZ = '?see_lz' + str(seeLZ)
    def getPage(self,pageNum):
        try:
            url = self.baseUrl + self.seeLZ + '&pn=' + str(pageNum)
            request = urllib2.Request(url)
            response = urllib2.urlopen(request)
            #print response.read().decode('utf-8')
            return response
        except urllib2.URLError,e:
            if hasattr(e,'reason'):
                print u'连接百度贴吧失败,错误原因',e.reason
                return None
    def getTitle(self):
        page = self.getPage(1)
        pattern = re.compile('<h3 class.*?px">(.*?)</h3>',re.S)
        result = re.search(pattern,page)
        if result:
            print result.group(1)
            return result.group(1).strip()
        else:
            return None
baseURL = 'http://tieba.baidu.com/p/4095047339'
bdtb = BDTB(baseURL,1)
bdtb.getTitle()
孙弘达
  • 13
  • 1
  • 2
    It will be very helpful to post error :) – WoodChopper Oct 31 '15 at 15:04
  • Why are you trying to parse HTML with regex? Learn more about [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/). – Remi Guan Oct 31 '15 at 15:16
  • or use pyquery. not as comprehensive as BeautifulSoup, but jquery selector use means you can test in browser first. Still, no reason to downvote a reasonable question by a noob who did work on it first. +1 folks, chill on -1s – JL Peyret Oct 31 '15 at 15:37

1 Answers1

1

This will raise a TypeError: expected string or buffer because you are passing the object returned from urllib2.urlopen(request) to re.search() when it requires an str.

If you change the return value from:

return responce  # returns the object

to one that returns the text contained in the request:

return responce.read()  # returns the text contained in the responce

Your script works and after executing it returns:

广告兼职及二手物品交易集中贴

Additionally, since you're working with Python 2.x you might want to change you object from class BDTB: to class BDTB(object) in order to use new style classes.

Community
  • 1
  • 1
Dimitris Fasarakis Hilliard
  • 150,925
  • 31
  • 268
  • 253