0

I was trying to download a webpage using a python script with selenium-webdriver, but it keep throwing a valueError exception, which lead to the download page to truncate.

It seems the file is truncated when there is some characters (like comma, hyphen ...) on the webpage.

The code:

    from pip.cmdoptions import global_options
    from selenium import webdriver
    from pyvirtualdisplay import Display
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC

    def contactbrowser(httppath,iterater):
        display = Display(visible=0, size=(800, 600))
        display.start()
        driver = webdriver.Firefox()#firefox_profile=fp)
        wd=driver.get(httppath)
        driver.maximize_window()
        try:
            element = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.TAG_NAME, "html"))
            )
            ele=driver.find_element_by_tag_name("h1")
            header1=ele.get_attribute("innerHTML")
            fullpath1=header1
            file = open("1/"+fullpath1+".html", "w")
            for ss in driver.page_source:
                file.write(bytearray([ord(ss)]))
            file.close()
            driver.close()

        except ValueError:
            print "Value error", httppath
            driver.close()
        except TypeError:
            driver.close()
        except:
            driver.close()

    list= []
    fileloc = open("file.txt", "r")
    line = fileloc.readline()
    while line:
        list.append(line)
        line = fileloc.readline()
    fileloc.close()
    count=0
    i=0
    while count<list.__len__():
        contactbrowser(list[count],i)
        count=count+1
        i=i+1

Eg: Downloading this page resulted in a truncated file.

Image

EDIT: The problem occurs when it stumbles on a value that doesn't have a corresponding ASCII. In the previous example, the word "first" written as " first" somewhere in the text, which led to the download interrupted, resulting in a truncated file.

unknown
  • 343
  • 3
  • 16
  • What is it that your script is supposed to be doing? I'm confused. – JeffC Oct 20 '15 at 15:40
  • It is supposed to download a web page. It doesn't meant to just save one page, like I provided in the example, but I need such scripts for downloading number of links specified in a text file. – unknown Oct 21 '15 at 06:40
  • Take a look at this Q: http://stackoverflow.com/questions/16604162/selenium-download-full-html-page. Read the first answer about javascript enabling. – JeffC Oct 21 '15 at 14:04

0 Answers0