2

I'm very new to the world of webscraping, I've just written some code to scrape the Indiegogo site when I manually download the page and that works great.

However, when I try to automate the script to fetch the page source for my list of urls, it fails no matter what method I use. I've attached my code to show the method I use and an image of the webpage when using Selenium to fetch the page. The site itself seems to reject any request not coming from me manually pointing my browser at each individual page. When I use requests to retrieve the page, it comes back with a 416 error.

Any help would be massively appreciated.

from bs4 import BeautifulSoup
import requests
import csv
from selenium import webdriver

#open list of unique indiegogo urls
f=open('urls.dat') 
urls = [url.strip() for url in f.readlines()]
f.close()
#prepare csv output file for writing results to
resultFile = open("output.csv",'wb')
headers = ["id", "url", "category", "created", "ends", "country", "currency", "goal", "funds", "funders"]
wr = csv.writer(resultFile, dialect='excel')
wr.writerow(headers)

#chromedriver = ...PATH_TO_YOUR_CHROMEDRIVER
#driver = webdriver.Chrome(chromedriver)
#driver = webdriver.Firefox()
#if using selenium, uncomment relevant webdriver

def getUrl(urls):
   for i in range (0,50713):
    print urls[i]       

    #res = requests.get(urls[i], headers=headers)
    #time.sleep(10)
    #soup = BeautifulSoup(res.text,'html.parser')

    #find the CDATA section where the parameters are listed 
    cdata = data.find(text=re.compile("CDATA"))
    #print cdata, len(cdata)
    #what parameters do we want, listed here
    country=[]
    currency=[]
    category=[]
    id=[]
    funders=[]
    funds=[]
    created=[]
    ends=[]
    goal=[]

The image is what I get when I use Selenium to crawl the site. Each page should pop up as it gets the html code. No matter what delay I use, it results in the same

  • Yeah, that's Distil Networks detecting the selenium-powered browser, see more at http://stackoverflow.com/a/33403473/771848. – alecxe Jan 19 '16 at 04:42
  • Damn, that's what I thought. It looks as though it'll be very tough to automate any sort of request in that case. Thanks for pointing me in the right direction though. – John Burton Jan 20 '16 at 06:06

0 Answers0