0

I have this code that iterates through a txt file of URLs and searches for files to download:

URLS = open("urlfile.txt").readlines()

def downloader():
    with open('data.csv', 'w') as csvfile:
        writer = csv.writer(csvfile)
        for url in downloadtools.URLS:
            try:
                html_data = urlopen(url)
            except:
                print 'Error opening URL: ' + url
                pass

            #Creates a BS object out of the open URL.
            soup = bs(html_data)
            #Parsing the URL for later use
            urlinfo = urlparse.urlparse(url)
            domain = urlparse.urlunparse((urlinfo.scheme, urlinfo.netloc, '', '', '', ''))
            path = urlinfo.path.rsplit('/', 1)[0]

            FILETYPE = ['\.pdf$', '\.ppt$', '\.pptx$', '\.doc$', '\.docx$', '\.xls$', '\.xlsx$', '\.wmv$', '\.mp4$', '\.mp3$']

            #Loop iterates through list of file types for open URL.
            for types in FILETYPE:
                for link in soup.findAll(href = compile(types)):
                    urlfile = link.get('href')
                    filename = urlfile.split('/')[-1]
                    while os.path.exists(filename):
                        try:
                            fileprefix = filename.split('_')[0]
                            filetype = filename.split('.')[-1]
                            num = int(filename.split('.')[0].split('_')[1])
                            filename = fileprefix + '_' + str(num + 1) + '.' + filetype
                        except:
                            filetype = filename.split('.')[1]
                            fileprefix = filename.split('.')[0] + '_' + str(1)
                            filename = fileprefix + '.' + filetype

                    #Creates a full URL if needed.
                    if '://' not in urlfile and not urlfile.startswith('//'):
                        if not urlfile.startswith('/'):
                            urlfile = urlparse.urljoin(path, urlfile)
                        urlfile = urlparse.urljoin(domain, urlfile)

                    #Downloads the urlfile or returns error for manual inspection
                    try:
                        urlretrieve(urlfile, filename, Percentage)
                        writer.writerow(['SUCCESS', url, urlfile, filename])
                        print "     SUCCESS"
                    except:
                        print "     ERROR"
                        writer.writerow(['ERROR', url, urlfile, filename])

Everything works fine except the fact that the data is not being written to the CSV. No directories are being changed (that I know of, at least...)

The script iterates through the external list of URLs, finds the files, downloads them properly, and prints "SUCCESS" or "ERROR" without issue. The only thing it's NOT doing is writing the data to the CSV file. It will run through in its entirety without writing any CSV data.

I tried running it in a virtualenv to make sure there wasn't any weird package issues.

Is there something going on with my embedded loops that causing the CSV data to fail to write?

asdoylejr
  • 664
  • 1
  • 9
  • 20
  • Where are you actually _calling_ `downloader()`? What is the actual output when running? – Joachim Isaksson Mar 27 '14 at 18:46
  • Are you saying that the final `except` clause is invoked? That a series of `'ERROR'` lines is written into the log? – Smandoli Mar 27 '14 at 18:46
  • It's being called in a `main()` function, which is in turn being called in `if __name__ == '__main__'` – asdoylejr Mar 27 '14 at 18:48
  • What do you mean is the except clause invoked? – asdoylejr Mar 27 '14 at 18:48
  • "Everything works fine except..." Maybe you can be more precise. At what point does it quit "working fine"? – Smandoli Mar 27 '14 at 18:48
  • By `except` clause I mean the final line of code in your code block. Is that bit of code run? – Smandoli Mar 27 '14 at 18:49
  • Sorry, edited my main question to explain what I mean by "everything is fine." – asdoylejr Mar 27 '14 at 18:52
  • The final `accept` only runs if the `urlretrieve()` fails. – asdoylejr Mar 27 '14 at 18:53
  • It's not the embedded loops. `writerow` will happen however many iterations it finds itself in. – broinjc Mar 27 '14 at 18:59
  • Try changing your `print " SUCCESS"` to `print " SUCCESS", ['SUCCESS', url, urlfile, filename]` and likewise for the `print " ERROR"` statement to see what data is supposedly being sent to `writer.writerow()`. – martineau Mar 27 '14 at 19:10
  • Shouldn't those backslashes in the `FILETYPE` strings be doubled? – martineau Mar 27 '14 at 19:13
  • Interestingly enough, I commented out the `urlretrieve()` call and the data started writing into the CSV. Why would the `urlretrieve()` cause an issue, and how can it be avoided? – asdoylejr Mar 27 '14 at 19:26
  • 1
    Using bare `except:` statements is generally not a good idea because it will catch all exceptions, even ones like `SyntaxError`, `SystemError`, and `EnvironmentError`. Better to be specific, even if it's just `except Exception as e:` which will avoid catching non-system-exiting exceptions and allow you to at least `print e` and see what was going on. – martineau Mar 27 '14 at 20:27

2 Answers2

2

Try with open('data.csv', 'wb') as csvfile: instead.

http://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files

or, build an iterable in place of writerow and later use writerows. If you run you script in interactive mode you can peek inside the contents of your iterable of rows. (i.e. [['SUCCESS',...],['SUCCESS',...],...])

import csv
with open('some.csv', 'wb') as f:
    writer = csv.writer(f)
    writer.writerows(someiterable)
broinjc
  • 2,619
  • 4
  • 28
  • 44
  • Thanks, but that does not work. Same issue...script runs without issue but no CSV data is written. – asdoylejr Mar 27 '14 at 19:00
  • Take a look at that edit. It's very odd that your script would print properly without writing... – broinjc Mar 27 '14 at 19:11
  • Commenting out the `urlretrieve()` call caused the data to be written to the CSV. Do you have any idea why that would happen? – asdoylejr Mar 27 '14 at 19:38
  • Hmmm, have you read these? http://docs.python.org/2/library/urllib.html#urllib.urlretrieve and http://stackoverflow.com/questions/987876/how-to-know-if-urllib-urlretrieve-succeeds – broinjc Mar 27 '14 at 19:41
  • Are you importing `urllib`? – broinjc Mar 27 '14 at 19:41
  • Yes. Even tried converting to Python 3 and still causing the same issue. – asdoylejr Mar 27 '14 at 19:43
  • Well, I didn't repaste my conversion to Python3, just made the changes locally for testing purposes. And no I haven't tried urllib2, I guess I can give that a try. – asdoylejr Mar 27 '14 at 19:48
0

So, I let the script run in its entirety and for some reason the data started writing to the CSV after it had been running for a while. I'm not sure how to explain that. The data had been somehow stored to memory and began writing at a random time? I don't know, but the data is accurate compared to the log printed in my terminal.

Weird.

asdoylejr
  • 664
  • 1
  • 9
  • 20