1

I'm trying to find a way to download all the docx files from the following URL, using Python or R:

https://www.microsoft.com/en-us/Investor/annual-reports.aspx

I've looked into similar questions (here, here, and here [I'm sorry, I noticed the files in the website are not PDFs but docx files) but none of the codes worked for me. Essentially, I'd like to download all the annual reports at the same time.

Thanks in advance

lovestacksflow
  • 521
  • 3
  • 14

1 Answers1

1

From what I see, there are no .docx files to download directly from the URL provided. All download links redirect to other pages, from where the documents can be downloaded. For starters, we could start listing those URLs in which the actual .docx files are presented as such:

import requests
from bs4 import BeautifulSoup

cookies = {
    'MS-CV': 'YqY5ipCxbUme1TJF.1',
}

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'DNT': '1',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'cross-site',
    'Sec-Fetch-User': '?1',
}

response = requests.get('https://www.microsoft.com/en-us/Investor/annual-reports.aspx', cookies=cookies, headers=headers)

scrape_list=[]
if response.ok:
    html = response.content
    soup = BeautifulSoup(html, "html.parser")
    
    for a in soup.find_all('a', href=True):
        if "reports" in a['href'] and "http" in a["href"] and "index" not in a["href"]:
            print(f"[{a['href']}]({a['href']})")
            scrape_list.append(a["href"])

Gives the following pages to further investigate:

https://www.microsoft.com/en-us/corporate-responsibility/reports-hubhttps://www.microsoft.com/investor/reports/ar22/download-center/ https://www.microsoft.com/investor/reports/ar21/download-center/ https://www.microsoft.com/investor/reports/ar20/download-center/ https://www.microsoft.com/investor/reports/ar19/download-center/ https://www.microsoft.com/en-us/annualreports/ar2018/annualreport https://www.microsoft.com/investor/reports/ar14/download-center.html https://www.microsoft.com/investor/reports/ar11/download_center.html http://www.microsoft.com/investor/reports/ar10/10k_dl_dow.html http://www.microsoft.com/investor/reports/ar09/10k_dl_dow.html http://www.microsoft.com/investor/reports/ar08/10k_dl_dow.html http://www.microsoft.com/investor/reports/ar07/staticversion/10k_dl_dow.html http://www.microsoft.com/investor/reports/ar06/staticversion/10k_dl_dow.html http://www.microsoft.com/investor/reports/ar05/staticversion/10k_dl_dow.html http://www.microsoft.com/investor/reports/ar04/nonflash/default.html http://www.microsoft.com/investor/reports/ar04/nonflash/default.html http://www.microsoft.com/investor/reports/ar04/nonflash/10k_dl_main.html http://www.microsoft.com/investor/reports/ar03/default.htm http://www.microsoft.com/investor/reports/ar03/default.htm http://www.microsoft.com/investor/reports/ar03/downloads.htm http://www.microsoft.com/investor/reports/ar02/default.htm http://www.microsoft.com/investor/reports/ar02/default.htm https://www.microsoft.com/investor/reports/ar02/downloads/default.htm http://www.microsoft.com/investor/reports/ar00/default.htm http://www.microsoft.com/investor/reports/ar00/default.htm http://www.microsoft.com/investor/reports/ar00/download.htm http://www.microsoft.com/investor/reports/ar99/default.htm http://www.microsoft.com/investor/reports/ar99/default.htm http://www.microsoft.com/investor/reports/ar99/download.htm http://www.microsoft.com/investor/reports/ar98/default.htm http://www.microsoft.com/investor/reports/ar98/default.htm http://www.microsoft.com/investor/reports/ar98/download.htm http://www.microsoft.com/investor/reports/ar97/default.htm http://www.microsoft.com/investor/reports/ar97/default.htm http://www.microsoft.com/investor/reports/ar97/default.htm http://www.microsoft.com/investor/reports/ar96/default.htm http://www.microsoft.com/investor/reports/ar96/default.htm http://www.microsoft.com/investor/reports/ar96/default.htm https://www.microsoft.com/investor/reports/ar22/download-center/

Next step is to retrieve all .docx URL's from these links as such:

from random import randint
from time import sleep
from tqdm import tqdm as tqdm

docx_files=[]
for i in tqdm(scrape_list):
    sleep(randint(2,5))
    response = requests.get(i, cookies=cookies, headers=headers)

    if response.ok:
        html = response.content
        soup = BeautifulSoup(html, "html.parser")
        for a in soup.find_all('a', href=True):
            if "docx" in a['href'] and "http" in a['href']:
                docx_files.append(a['href'])

    if not response.ok:
        print(response, "for", i)

 15%|█▌        | 6/39 [00:01<00:11,  2.96it/s]
<Response [404]> for https://www.microsoft.com/en-us/annualreports/ar2018/annualreport
 41%|████      | 16/39 [00:07<00:14,  1.59it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar04/nonflash/default.html 
 46%|████▌     | 18/39 [00:08<00:13,  1.52it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar03/default.htm 
 49%|████▊     | 19/39 [00:09<00:13,  1.49it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar03/default.htm 
 54%|█████▍    | 21/39 [00:10<00:11,  1.50it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar02/default.htm 
 59%|█████▉    | 23/39 [00:11<00:08,  1.95it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar02/default.htm 
 62%|██████▏   | 24/39 [00:12<00:08,  1.77it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar00/default.htm 
 64%|██████▍   | 25/39 [00:12<00:08,  1.64it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar00/default.htm 
 69%|██████▉   | 27/39 [00:14<00:07,  1.56it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar99/default.htm 
 72%|███████▏  | 28/39 [00:14<00:07,  1.53it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar99/default.htm 
 77%|███████▋  | 30/39 [00:16<00:06,  1.48it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar98/default.htm 
 79%|███████▉  | 31/39 [00:17<00:05,  1.47it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar98/default.htm 
 85%|████████▍ | 33/39 [00:18<00:04,  1.45it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar97/default.htm 
 87%|████████▋ | 34/39 [00:19<00:03,  1.45it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar97/default.htm 
 92%|█████████▏| 36/39 [00:20<00:02,  1.45it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar96/default.htm 
 95%|█████████▍| 37/39 [00:21<00:01,  1.45it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar96/default.htm 
100%|██████████| 39/39 [00:22<00:00,  1.77it/s]

docx_files:

['https://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/2022_Annual_Report.docx?version=71ba1987-3b76-28bc-5a24-b81f10f3a7a0',
 'https://c.s-microsoft.com/en-us/CMSFiles/2022_Annual_Report.docx?version=71ba1987-3b76-28bc-5a24-b81f10f3a7a0',
 'https://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/MSFT_FY22Q4_10K.docx?version=401b18a2-7dfa-3105-fd3b-90c1a5ca04bb',
 'https://c.s-microsoft.com/en-us/CMSFiles/MSFT_FY22Q4_10K.docx?version=401b18a2-7dfa-3105-fd3b-90c1a5ca04bb',
 'https://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/2022_Shareholder_Letter.docx?version=cb4b090f-97d2-ef69-0b8d-8d5b061d45aa',
 'https://c.s-microsoft.com/en-us/CMSFiles/2022_Shareholder_Letter.docx?version=cb4b090f-97d2-ef69-0b8d-8d5b061d45aa',
 'http://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/2021_Annual_Report.docx?version=5290c17d-8858-c9ef-d16f-60e02f42214e',
 'https://c.s-microsoft.com/en-us/CMSFiles/2021_Annual_Report.docx?version=5290c17d-8858-c9ef-d16f-60e02f42214e',
 'https://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/MSFT_FY21Q4_10K.docx?version=01062c71-0508-22e2-9fb8-9ecf51bb6378',
 'https://c.s-microsoft.com/en-us/CMSFiles/MSFT_FY21Q4_10K.docx?version=01062c71-0508-22e2-9fb8-9ecf51bb6378',
 'http://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/2021_Shareholder_Letter.docx?version=2f3bdd2d-4568-e98a-645d-713ac2461978',
 'https://c.s-microsoft.com/en-us/CMSFiles/2021_Shareholder_Letter.docx?version=2f3bdd2d-4568-e98a-645d-713ac2461978',
 'https://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/2020_Annual_Report.docx?version=8a3ca1db-2de7-c0e7-d7c5-176c412a395e',
 'https://c.s-microsoft.com/en-us/CMSFiles/2020_Annual_Report.docx?version=8a3ca1db-2de7-c0e7-d7c5-176c412a395e',
 'http://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/MSFT_FY20Q4_10K.docx?version=71873a68-d431-e887-124f-4d24b9ade60c',
 'https://c.s-microsoft.com/en-us/CMSFiles/MSFT_FY20Q4_10K.docx?version=71873a68-d431-e887-124f-4d24b9ade60c',
 'https://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/2020_Shareholder_Letter.docx?version=e783c2cc-537c-fdb7-5a0c-73a848250f05',
 'https://c.s-microsoft.com/en-us/CMSFiles/2020_Shareholder_Letter.docx?version=e783c2cc-537c-fdb7-5a0c-73a848250f05',
 'https://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/MSFT_FY19Q4_10K.docx?version=0a785912-1d8b-1ee0-f8d8-63f2fb7a5f00',
 'https://c.s-microsoft.com/en-us/CMSFiles/MSFT_FY19Q4_10K.docx?version=0a785912-1d8b-1ee0-f8d8-63f2fb7a5f00',
 'https://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/2019_Shareholder_Letter.docx?version=56169a49-efd1-27be-1777-6c36b3426da1',
 'https://c.s-microsoft.com/en-us/CMSFiles/2019_Shareholder_Letter.docx?version=56169a49-efd1-27be-1777-6c36b3426da1',
 'http://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/2019_Proxy_Statement.docx?version=eae4affc-fe10-b796-c108-566255610e0f',
 'http://view.officeapps.live.com/op/view.aspx?src=http://www.microsoft.com/investor/reports/ar14/docs/2014_Annual_Report.docx',
 'http://view.officeapps.live.com/op/view.aspx?src=http://www.microsoft.com/investor/reports/ar14/docs/MSFT_FY14Q4_10K.docx',
 'http://view.officeapps.live.com/op/view.aspx?src=http://www.microsoft.com/investor/reports/ar14/docs/2014_Shareholder_Letter.docx',
 'http://www.microsoft.com/investor/Downloads/Investor Services/Information for Investors/2011_Proxy_Statement.docx',
 'http://www.microsoft.com/investor/Downloads/Investor Services/Information for Investors/2010_Proxy_Statement.docx',
 'http://cid-4910e8dd2e872bb2.office.live.com/view.aspx/FY2010/Microsoft%202010%2010K.docx',
 'http://cid-4910e8dd2e872bb2.office.live.com/view.aspx/FY2010/Microsoft%202010%20Annual%20Report.docx',
 'https://cid-4910e8dd2e872bb2.office.live.com/view.aspx/FY2010/2010%20Letter%20to%20Shareholders.docx',
 'https://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/2022_Annual_Report.docx?version=71ba1987-3b76-28bc-5a24-b81f10f3a7a0',
 'https://c.s-microsoft.com/en-us/CMSFiles/2022_Annual_Report.docx?version=71ba1987-3b76-28bc-5a24-b81f10f3a7a0',
 'https://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/MSFT_FY22Q4_10K.docx?version=401b18a2-7dfa-3105-fd3b-90c1a5ca04bb',
 'https://c.s-microsoft.com/en-us/CMSFiles/MSFT_FY22Q4_10K.docx?version=401b18a2-7dfa-3105-fd3b-90c1a5ca04bb',
 'https://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/2022_Shareholder_Letter.docx?version=cb4b090f-97d2-ef69-0b8d-8d5b061d45aa',
 'https://c.s-microsoft.com/en-us/CMSFiles/2022_Shareholder_Letter.docx?version=cb4b090f-97d2-ef69-0b8d-8d5b061d45aa']

Last step is to download these .docx files:

def download_file(url):
    local_filename = url.split('/')[-1]
    #If desired you can save it to some directory of interest as local_filename="/dir/of/interest",local_filename or feed it to function external
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_filename, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192): 
                f.write(chunk)
    return local_filename

for i in docx_files:
    download_file(i)
Rivered
  • 741
  • 7
  • 27
  • Thanks. So the next step is to feed these URLs into a scraping code? – lovestacksflow Apr 24 '23 at 12:59
  • Yes we repeat the same process, except this time you collect all hrefs containing docx, and then you start feeding those docx URLs to your code to download them. – Rivered Apr 24 '23 at 16:21
  • You might want to check if there are also other file formats outside of .docx which you want to download. – Rivered Apr 24 '23 at 16:34
  • On a follow up note, it seems to miss certain files, such as the one for 2014 with this [link](https://www.microsoft.com/investor/reports/ar14/docs/2014_Annual_Report.docx) Any possible reason for that? – lovestacksflow Apr 29 '23 at 17:14
  • Sorry, the actual link should be for the 2015 report [here](https://www.microsoft.com/investor/reports/ar15/docs/2015_Annual_Report.docx) – lovestacksflow Apr 29 '23 at 21:25
  • Nevermind, I figured it's because the index is excluded in the initial condition, so I had to remove it. – lovestacksflow Apr 29 '23 at 21:32
  • Glad to see it worked out :) – Rivered Apr 30 '23 at 19:15