1

I have written a script that scrapes a URL. It works fine on Linux OS. But i am getting http 503 error when running on Windows 7. The URL has some issue. I am using python 2.7.11 . Please help. Below is the script:

import sys # Used to add the BeautifulSoup folder the import path
import urllib2 # Used to read the html document

if __name__ == "__main__":
    ### Import Beautiful Soup
    ### Here, I have the BeautifulSoup folder in the level of this Python script
    ### So I need to tell Python where to look.
    sys.path.append("./BeautifulSoup")
    from bs4 import BeautifulSoup

    ### Create opener with Google-friendly user agent
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]

    ### Open page & generate soup
    ### the "start" variable will be used to iterate through 10 pages.
    for start in range(0,1000):
        url = "http://www.google.com/search?q=site:theknot.com/us/&start=" + str(start*10)
        page = opener.open(url)
        soup = BeautifulSoup(page)

        ### Parse and find
        ### Looks like google contains URLs in <cite> tags.
        ### So for each cite tag on each page (10), print its contents (url)
   file = open("parseddata.txt", "wb")
    for cite in soup.findAll('cite'):
                print cite.text
                file.write(cite.text+"\n")
                # file.flush()
                # file.close()

In case you run it in windows 7, the cmd throws http503 error stating the issue is with url. The URL works fine in Linux OS. In case URL is actually wrong please suggest the alternatives.

martineau
  • 119,623
  • 25
  • 170
  • 301
  • That's unrelated, but did you know instead of generating 10 requests, you can add `&num=100` at the end of your search url and get 100 results at once? – spectras Jun 21 '16 at 10:35
  • 3
    Google may be blocking your IP because of too many connections. – Barmar Jun 21 '16 at 10:35
  • Actually, very good point @Barmar. It's true that automated queries are against Google's Terms of Service, they may end up blocking you. It's odd that it would consistently happen from his Windows and not from his Linux box though. – spectras Jun 21 '16 at 10:38
  • @spectras Maybe he just hasn't done it enough from the Linux box to get noticed. – Barmar Jun 21 '16 at 10:40
  • @spectras OK Barmar too many connections? And any way to confirm that. – Shailendra Baranwal Jun 21 '16 at 10:41
  • @ShailendraBaranwal If you have access to a Windows 7 box on another IP, you could try it there. Also try reducing the size of the range so you don't open 1,000 connections in rapid fire. – Barmar Jun 21 '16 at 10:43
  • 1
    Turn off the windows firewall and test again? or check if firewall is blocking python interpreter. – Mark Evans Jun 21 '16 at 11:01
  • This question is solved. How do i set this question to answered ? – Shailendra Baranwal Jun 21 '16 at 11:42
  • @ShailendraBaranwal> you click the tick mark just below the score of the answer that best helped you (there is just one there but for next time…). When you'll have 15 rep you'll also be able to upvote all helpful answers. – spectras Jun 21 '16 at 12:11

1 Answers1

0

Apparently with Python 2.7.2 on Windows, any time you send a custom User-agent header, urllib2 doesn't send that header. (source: https://stackoverflow.com/a/8994498/6479294).

So you might want to consider using requests instead of urllib2 in Windows:

import requests
# ...
page = requests.get(url)
soup = BeautifulSoup(page.text)
# etc...

EDIT: Also a very good point to be made is that Google may be blocking your IP - they don't really like bots making 100 odd requests sequentially.

Community
  • 1
  • 1
Thom
  • 120
  • 4