2

As a newbie in Python (2.7) I`m looking for next suggestion:

I have a csv file with stored http links in one column comma delimited.

http://example.com/file.pdf,
http://example.com/file.xls,
http://example.com/file.xlsx,
http://example.com/file.doc,

The main aim is to loop through all these links and download files by them in original extention and name.

So my search results and help here gave me next script:

import urllib2
import pandas as pd 

links = pd.read_csv('links.csv', sep=',', header =(0))

url = links                   # I know this part wrong by don`n know how to do right

user_agent = 'Mozilla 5.0 (Windows 7; Win64; x64)'

file_name = "tessst"          # here the files name by how to get their original names

u = urllib2.Request(url, headers = {'User-Agent' : user_agent})
req = urllib2.urlopen(u)
f = open(file_name, 'wb')
f.write(req.read())

f.close()

please any help

P S not sure about pandas - maybe csv better?

David Graig
  • 153
  • 1
  • 9

1 Answers1

3

If I can assume your CSV file to be one column only, containing links then this would work .

import csv, sys
import requests
import urllib2
import os

filename = 'test.csv'
with open(filename, 'rb') as f:
    reader = csv.reader(f)
    try:
        for row in reader:
            if 'http' in row[0]:
                #print row
                rev  = row[0][::-1]
                i  = rev.index('/')
                tmp = rev[0:i]
                #print tmp[::-1]
                rq = urllib2.Request(row[0])
                res = urllib2.urlopen(rq)
                if not os.path.exists("./"+tmp[::-1]):                
                    pdf = open("./" + tmp[::-1], 'wb')
                    pdf.write(res.read())
                    pdf.close()
                else:
                    print "file: ", tmp[::-1], "already exist"
    except csv.Error as e:
        sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))
  • in general it works after some changes (after adding headers) but it rewrites the file – David Graig Mar 24 '16 at 08:35
  • Glad it came to your use. I have changed the code now it will download only if the file is not previously downloaded. ###peace – Sayed Zainul Abideen Mar 24 '16 at 09:02
  • Thank you for you answers - but the main aim - to have all files - not one - is still unreached – David Graig Mar 24 '16 at 09:41
  • Please correct me if i got the question wrong. You have a CSV File which have a column of url . So you want to extract all the urls from csv and loop through them and download all (PDF|DOC|DOCX|*) file from there. So i created a test.csv which have urls and downloaded all assets from there for every url. – Sayed Zainul Abideen Mar 24 '16 at 10:24
  • yes г got me right. but - the script is still rewrite the final file. It downloads all the links but always rewrites the final downloaded file with new one – David Graig Mar 24 '16 at 11:14
  • Hold on! my foult I did not change filename in original script ) – David Graig Mar 24 '16 at 11:21