0
import re
import requests
import bs4
import datetime
from urllib import quote
import urllib2, cookielib

class smzdm(object):
    def __init__(self):
        pass

    def getoff(self,keyword):

        headers={'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Encoding':'gzip, deflate, sdch',
        'Accept-Language':'zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4',
        'Cache-Control':'max-age=0',
        'Connection':'keep-alive',
        'Cookie':'''smzdm_user_source=0976117114331F28EB6D3C3979605D97; __gads=ID=7f857c698119a4fb:T=1473839452:S=ALNI_MYdWGclXArRQSMaAa_ReL0LFFJBfg; __jsluid=1676c2e0d0b8dcf7a752595a8db32ed6; smzdm_wordpress_360d4e510beef4fe51293184b8908074=user%3A5933507146%7C1483103361%7C608649ec42bc38522a6e17f8cd013d9b; smzdm_wordpress_logged_in_360d4e510beef4fe51293184b8908074=user%3A5933507146%7C1483103361%7Ceb944dea579f5e1e0318bef13f2cda21; user-role-smzdm=subscriber; sess=ZGQyYmF8MTQ4MzEwMzM2MXw1OTMzNTA3MTQ2fDlmYWYxYTQxNzIzODliNmI3M2VlY2Q2MzYyM2IwYjYz; user=user%3A5933507146%7C5933507146; PHPSESSID=ecndosaes8t2ddum97c5h60t62; wt3_eid=%3B999768690672041%7C2147461801100790519%232148117884200550731; wt3_sid=%3B999768690672041; smzdm_user_view=453FB6F62468F7F6C548F0B86926372B; crtg_rta=criteo_D_728*90%3D1%3Bcriteo_300600zy03%3D1%3Bcriteo_300250zy02%3D1%3Bcriteo_300250zy01%3D1%3Bcriteo_300250newzy02%3D1%3Bcriteo_300600newzy11%3D1%3B; s_his=%E8%80%B3%E6%B8%A9%E6%9E%AA%2C%E6%96%B0%E5%AE%89%E6%80%A1%20%E5%90%B8%E5%A5%B6%E5%99%A8%2CSCF332%2F01%2C%E6%B8%A9%E5%A5%B6%E5%99%A8%2C%E5%A5%B6%E7%93%B6%E6%B6%88%E6%AF%92%2C%E5%A4%A7%E7%8E%8B%E5%A4%A9%E4%BD%BF%2C%E5%90%B8%E5%A5%B6%E5%99%A8; Hm_lvt_9b7ac3d38f30fe89ff0b8a0546904e58=1480035459; Hm_lpvt_9b7ac3d38f30fe89ff0b8a0546904e58=1481250397; _ga=GA1.2.1138807399.1473839451; _gat_UA-27058866-1=1; amvid=f192af9ebab282d29dc88438c918d2af''',
        'Host':'search.smzdm.com',
        'Upgrade-Insecure-Requests':1,
        'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36'}

        url='http://search.smzdm.com/?c=faxian&s=%s'%keyword

        req=requests.get(url,headers=headers)       
        soup = bs4.BeautifulSoup(req.text, 'lxml')

        soupstr=str(soup)

        start=datetime.datetime.now()

        result=re.findall(r'<a class="z-btn z-btn-red".*? href="(.*?)" onclick="dataLayer.push\((.*?)\)" target="_blank">直达链接</a>.*?  (\d{2}:\d{2})',soupstr,re.S)  #No.1

        #result=re.findall(r'<a class="z-btn z-btn-red".*? href="(.*?)" onclick="dataLayer.push\((.*?)\)" target="_blank">直达链接</a>.*? (\d{2}:\d{2})',soupstr,re.S)  #No.2
        ctime=datetime.datetime.now()-start

        print 'keyword %s has %s results costs %s' %(keyword,len(result),ctime)
        return result

if __name__ == '__main__':
    sm=smzdm()
    sm.getoff('philips')

This code is used to search smzdm.com(a chinese e-business website) for my keywords then return the promotion infomation of the today. Look at the No.2 re.findall,it runs in 1ms.However the No.1 re.findall,it needs almost 3 MINUTES.The fewer matching results will cost more running time.The only different with them is that there is one more SPACE between </a>.*? AND (\d{2}:\d{2}) in No.1 than No.2. Why this problem happened and how to optimize my code.Thank You.

mufubin
  • 3
  • 2

2 Answers2

2

You should use the beautiful soup tree soup to search for the correct <a>-tag, like

buttons = soup.findAll("a", {"class": "z-btn-red"}, text="直达链接")

and go from there onwards.

Daniel
  • 42,087
  • 4
  • 55
  • 81
  • The infomation I want to capture is a little complex than your code capture,but I know how to optimize my code.I need to learn more in bs4 – mufubin Dec 11 '16 at 12:51
0

You should just not be parsing HTML with regular expressions.

Please refer to one of the most upvoted answer in StackOverflow

You should be using elemtree or something similar. Regex in long strings is heavily ressource consuming, hence the slow execution; anyways, you will eventually (almost certainly) get wrong results since regex do only work with regular languages generated by type three grammars (which HTML isn't)

Community
  • 1
  • 1
Luis Sieira
  • 29,926
  • 3
  • 31
  • 53