0

I'm trying to download multiple word documents off of a website into a folder that I can iterate through. They are hosted in a sharepoint list, and I've already been able to parse the HTML code to compile a list of all the links to these word documents. These links (when clicked) prompt you to open or save a word document. In the end of these links, the title of the word doc is there too. I've been able to split the URL strings to get a list of the names of the word documents that line up with my list of URLs. My goal is to write a loop that will go through all the URLs and download all the word documents into a folder. EDIT- taking into consideration @DeepSpace and @aneroid 's suggestions (and trying my best to implement them)... My code-

 import requests
 from requests_ntlm import HttpNtlmAuth
 import shutil

 def download_word_docs(doc_url, doc_name):
    r = requests.get(doc_url, auth=HttpNtlmAuth(domain\\user, pass), stream=True)
    with open(doc_name, 'wb') as f:                                                                                                                                                
       shutil.copyfileobj(r.raw, f) #where's it copying the fileobj to?

I think this is different from an image because my request is to a download link rather than a physical jpeg image... I may be wrong but this is a tricky situation.

I'm still trying to get my program to download (or create a copy of) a .docx into a folder with a specified path (that I can set). Currently it runs in the admin command prompt (I'm on Windows) without error but I don't know where it's copying the file to. My hope is that if I can get one to work I can figure out how to loop it over my list of URLs. Thanks guys (@DeepSpace and @aneroid) for your help thus far.

Vince
  • 171
  • 1
  • 3
  • 11
  • by default python will use the same folder as the program as the working directory, so I'd expect the files to be put there. But is it possible that `download_word_docs` isn't being called at all? If you add a `print("download_word_docs was called")` statement in it does it actually get printed? – Tadhg McDonald-Jensen Jun 27 '16 at 16:55

2 Answers2

0

In your code, you've mentioned

"Any way to avoid opening/writing a new file and download it directly?"

There is no download directly. That's what browsers do via code similar to what you're trying to write. They are "creating a new file" with the name specified by the server or URL.

I wrote this a few days ago for something else, and similar to the answer linked by @DeepSpace:

def save_link(book_link, book_name):
    the_book = requests.get(book_link, stream=True)
    with open(book_name, 'wb') as f:
        for chunk in the_book.iter_content(1024 * 1024 * 2):  # 2 MB chunks
            f.write(chunk)

book_name was retrieved from the book_link's text in another function but what you can also do this:

  1. Check if the response headers include a filename.

  2. If not, use the end of the URL as the filename, if possible:

    >>> the_link = 'http://example.com/some_path/Special%20document.doc'
    >>> filename = urllib.unquote_plus(the_link.split('/')[-1])
    >>> print filename
    Special document.doc
    >>> # then do
    ... with open(filename, 'wb') as f:
    ....    # etc.
    
Community
  • 1
  • 1
aneroid
  • 12,983
  • 3
  • 36
  • 66
-1

try this code and see if it works for you:

from urllib.request import Request, urlopen

def get_html(url, timeout = 15):
    ''' function returns html of url
    usually html = urlopen(url) is enough but sometimes it doesn't work
    also instead urllib.request you can use any other method to get html
    code of url like urllib or urllib2 (just search it online), but I
    think urllib.request comes with python installation'''

    html = ''
    try:
        html = urlopen(url, None, timeout)
    except:
        url = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
        try:
            html = urlopen(url, None, timeout)
        except:
            pass
    return html

def get_current_path():
    ''' function returns path of folder in which python program is saved'''

    try:
        path = __file__
    except:
        try:
            import sys
            path = sys.argv[0]
        except:
            path = ''
    if path:
        if '\\' in path:
            path = path.replace('\\', '/')
        end = len(path) - path[::-1].find('/')
        path = path[:end]
    return path

def check_if_name_already_exists(name, path, extension):
    ''' function checks if there is already existing file
    with same name in folder given by path.'''

    try:
        file = open(path + name + extension, 'r')
        file.close()
        return True
    except:
        return False

def get_new_name(old_name, path, extension):
    ''' functions ask user to enter new name for file and returns inputted name.'''

    print('File with name "{}" already exist.'.format(old_name))
    answer = input('Would you like to replace it (answer with "r")\nor create new one (answer with "n") ? ')
    while answer not in 'rRnN':
        print('Your answer is inconclusive')
        print('Please answer again:')
        print('if you would like to replece the existing file answer with "r"')
        print('if you would like to create new one answer with "n"')
        answer = input('Would you like to replace it (answer with "r")\n or create new one (answer with "n") ? ')
    if answer in 'nN':
        new_name = input('Enter new name for file: ')
        if check_if_name_already_exists(new_name, path, extension):
            return get_new_name(new_name, path)
        else:
            return new_name
    if answer in 'rR':
        return old_name

def get_url_extension(url):
    if url[::-1].find('cod.') == 0:
        return '.doc'
    if url[::-1].find('xcod.') == 0:
        return '.docx'

def download_word(url, name = 'document', path = None):
    '''function downloads word file from its url
    required argument is url of pdf file and
    optional argument is name for saved pdf file and
    optional argument path if you want to choose where is your file saved
    variable path must look like:
        'C:\\Users\\Computer name\\Desktop' or
        'C:/Users/Computer name/Desktop' '''
    # and not like
    #   'C:\Users\Computer name\Desktop'

    word = get_html(url)
    extension = get_url_extension(url)

    name = name.replace(extension, '')
    if path == None:
        path = get_current_path()
    if '\\' in path:
        path = path.replace('\\', '/')
    if path[-1] != '/':
        path += '/'
    if path:
        check = check_if_name_already_exists(name, path, extension)
        if check:
            if name == 'document':
                i = 1
                name = 'document(' + str(i) + ')'
                while check_if_name_already_exists(name, path, extension):
                    i += 1
                    name = 'document(' + str(i) + ')'
            else:
                name = get_new_name(name, path, extension)
        file = open(path+name + extension, 'wb')
    else:
        file = open(name + extension, 'wb')

    file.write(word.read())
    file.close()
    if path:
        print(name + extension + ' file downloaded in folder "{}".'.format(path))
    else:
        print(name + extension + ' file downloaded.')
    return


download_url = 'http://www.scripps.edu/library/open/instruction/googletips.doc'
download_url = 'http://regionblekinge.se/a/uploads/dokument/demo.docx'
download_word(download_url)
ands
  • 1,926
  • 16
  • 27