0

I am struggling to get my code, that scrapes HTML table info from web, to work through a list of websites held in ShipURL.txt file. The code reads in the web page addresses from ShipURL and then goes to the link and downloads the table data and saves it to csv. But my problem is that the program cannot finish, as the error "A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond" occurs in the middle and the program stops. Now as I understand I need to increase the request time, use a proxy or make a try statement. I have scanned through a few answers concerning the same problem, but as an novice I am finding it hard to understand. Any help would be appreciated.

ShipURL.txt https://dl.dropboxusercontent.com/u/110612863/ShipURL.txt

# -*- coding: utf-8 -*-
fm = open('ShipURL.txt', 'r')
Shiplinks = fm.readlines()

import csv
from urllib import urlopen
from bs4 import BeautifulSoup
import re
for line in Shiplinks:
    website = re.findall(r'(https?://\S+)', line)
    website = "".join(str(x) for x in website)
    if website != "":

    with open('ShipData.csv','wb')as f:                         #Creates an empty csv file to which assign values.
        writer = csv.writer(f)
        shipUrl = website
        shipPage = urlopen(shipUrl)

        soup = BeautifulSoup(shipPage, "html.parser")           #Read the web page HTML
        table = soup.find_all("table", { "class" : "table1" })  #Finds table with class table1
        List = []
        columnRow = ""
        valueRow = ""
        Values = []
        for mytable in table:                                   #Loops tables with class table1
            table_body = mytable.find('tbody')                  #Finds tbody section in table
            try:                                                #If tbody exists
                rows = table_body.find_all('tr')                #Finds all rows
                for tr in rows:                                 #Loops rows
                    cols = tr.find_all('td')                    #Finds the columns
                    i = 1                                       #Variable to control the lines
                    for td in cols:                             #Loops the columns
    ##                    print td.text                           #Displays the output
                        co = td.text                            #Saves the column to a variable
    ##                    writer.writerow([co])                 Writes the variable in CSV file row
                        if i == 1:                              #Checks the control variable, if it equals to 1

                            if td.text[ -1] == ":":
                                # võtab kooloni maha ja lisab koma järele
                                columnRow += td.text.strip(":") + "," # Tekkis mõte, et vb oleks lihtsam kohe ühte string panna
                                List.append(td.text)                #.. takes the column value and assigns it to a list called 'List' and..
                                i+=1                                #..Increments i by one

                        else:
                            # võtab reavahetused maha ja lisab koma stringile
                            valueRow += td.text.strip("\n") + ","
                            Values.append(td.text)              #Takes the second columns value and assigns it to a list called Values
                        #print List                             #Checking stuff
                        #print Values                           #Checking stuff


            except:
                print "no tbody"
        # Prindime pealkirjad ja väärtused koos reavahetusega välja ka :)
        print columnRow.strip(",")
        print "\n"
        print valueRow.strip(",")
        # encode'ing hakkas jälle kiusama
        # Kirjutab esimeseks reaks veeru pealkirjad ja teiseks väärtused
        writer.writerow([columnRow.encode('utf-8')])
        writer.writerow([valueRow.encode('utf-8')])
Gert Lõhmus
  • 79
  • 2
  • 12

3 Answers3

1

I would wrap your urlopen call with a try/catch. Like this:

try:
  shipPage = urlopen(shipUrl)
except Error as e:
  print e

That'll at least help you figure out where the error is happening. Without the extra files, it'd be hard to troubleshoot, otherwise.

Python errors documentation

Eric
  • 709
  • 6
  • 5
  • Also, after you set website, maybe try adding a 'print website' statement, to make sure your urls look right. – Eric Dec 16 '15 at 15:27
  • The link is correct, but I cannot hold the connection. The error seems to be the line shipPage = urlopen(shipUrl), it can't get the website for 5 th or sometimes 3-rd or 7-th address. I tried using shipPage = urllib2.urlopen(shipUrl,timeout=30). No aid. – Gert Lõhmus Dec 16 '15 at 15:51
  • Does it happen on a specific address when you're setting shipPage? Maybe try opening up that url in your browser to see what happens. – Eric Dec 16 '15 at 16:03
0

Websites protect their-selves against DDOS attacks by preventing successive access from a single IP.

You should put a sleep time between each access ,or at each 10 accesses or 20 or 50.

Or you may have to anonymize your access through tor network or any alternative

Assem
  • 11,574
  • 5
  • 59
  • 97
0

Found some great info on this link: How to retry after exception in python? It is basically my connection problem so I decided to try until it succeeds. At the moment it is working. Solved the problem with this code:

 while True:
                try:
                    shipPage = urllib2.urlopen(shipUrl,timeout=5)
                except Exception as e:
                    continue
                break

But I do thank everybody here, you helped me understand the problem a lot better!

Community
  • 1
  • 1
Gert Lõhmus
  • 79
  • 2
  • 12