0

I am attempting to just generate traffic on the network by having a large list of sites that I am opening from a text file.

I then would like to get the site and all the href links. Go to those links and then the site, then proceed onto the next site in the text document.

My problem (that I have been noticing) is that it taking a while to execute these statements, Upwards of 5 seconds per curl. Is this because of my excessive use of try except loops? I'm just trying to understand where the problem may be.

2018-03-14 16:30:32.590135

http://www.ipostparcels.com/parcel-delivery/amazon-parcel-delivery

2018-03-14 16:30:37.653522

http://www.ipostparcels.com/parcel-delivery/abot-ipostparcels

2018-03-14 16:30:42.716842

http://www.ipostparcels.com/parcel-delivery/parcel-delivery-rates

2018-03-14 16:30:47.762127

http://www.ipostparcels.com/parcel-delivery/parcel-collection-and-delivery

2018-03-14 16:30:52.809792

http://www.ipostparcels.com/parcel-delivery/post-for-a-post

2018-03-14 16:30:57.876936

http://www.ipostparcels.com/parcel-delivery/discont-codes-and-offers

2018-03-14 16:31:02.947123

http://www.ipostparcels.com/corier/ebay-corier-service

#!/usr/bin/python
from bs4 import BeautifulSoup
import urllib2
import pycurl
from io import BytesIO
import os
import re
import sys
import random
from datetime import datetime

links = []

while True:
  with open("topdomains3.txt", "r") as f:
      domains = list(f)
      joker=random.randint(1, len(domains))
      for i in domains[joker:len(domains)]:
        i=i.replace("\\n", "")
        i=i.replace("None", "")
        i=i.rstrip()
        print i
        try:
          c = pycurl.Curl()
          c.setopt(c.URL, i)
          c.setopt(pycurl.TIMEOUT, 3)
          c.setopt(c.FOLLOWLOCATION, True)
          c.setopt(c.MAXREDIRS , 5)
          try:
            i='http://' + i
            html_page = urllib2.urlopen(i)
            soup = BeautifulSoup(html_page, 'html5lib')
          except Exception,e:
            print e
            continue
          for link in soup.findAll('a', attrs={'href': re.compile("^http")}):

            links.append(link.get('href').replace("u", ""))
          for a in links:
            try:
              print "----------------------------------------------------------"
              print str(datetime.now())
              print a
              d = pycurl.Curl()
              #c.setopt(c.VERBOSE, True)
              d.setopt(d.URL, str(a))
              #c.setopt(c.WRITEDATA, buffer)
              d.setopt(d.TIMEOUT, 3)
              d.setopt(d.FOLLOWLOCATION, True)
              d.setopt(d.MAXREDIRS , 5)
              #d.setopt(pycurl.WRITEFUNCTION, lambda x: None)
              d.perform()
              d.close()
            except pycurl.error:
              continue
          c.perform()
          c.close()
        except pycurl.error:
          continue

any assistance would be appreciated.

Spyderz
  • 19
  • 1
  • 4
  • What's the reason for using PyCurl over something like requests? – G_M Mar 14 '18 at 21:20
  • I've just heard pycurl was faster for me to get the result. https://stackoverflow.com/questions/15461995/python-requests-vs-pycurl-performance – Spyderz Mar 14 '18 at 21:33
  • The urllib2 section is just for beautiful soup to grab the href links on the page.The script as a whole is just for traffic generation, so for the most part I don't care about the response from the web server. – Spyderz Mar 14 '18 at 22:51
  • I am trying to fill up an http traffic logging device. It's not necessarily the pycurl that takes 5 seconds it just takes about 5 seconds per loop in the for loop. Since pycurl is faster I am having that do most of the requests and urllib only gets the href links. – Spyderz Mar 14 '18 at 23:10
  • Have you thought about using [`Scrapy`](https://scrapy.org/)? It's asynchronous and would probably be able to send a lot more traffic and grab the links for you too. Or do your requests have to be one after the other (synchronous)? – G_M Mar 14 '18 at 23:11
  • I found and considered scrapy one I had most of the script done. Didn't know it was async. Can you async with scrapy on python 2.7? Or is that exclusive to 3.4+? – Spyderz Mar 14 '18 at 23:18
  • Scrapy works on both 2 & 3 (I think it uses twisted for async). Yeah, [2.7 and 3.4+](https://docs.scrapy.org/en/latest/intro/install.html#installing-scrapy) – G_M Mar 14 '18 at 23:20
  • Thanks I'll have to try it out. Async would definitely help generate traffic faster. Do you have any links to an example code that would help me out in my situation? – Spyderz Mar 14 '18 at 23:26

0 Answers0