From what I see, there are no .docx files to download directly from the URL provided. All download links redirect to other pages, from where the documents can be downloaded. For starters, we could start listing those URLs in which the actual .docx files are presented as such:
import requests
from bs4 import BeautifulSoup
cookies = {
'MS-CV': 'YqY5ipCxbUme1TJF.1',
}
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'cross-site',
'Sec-Fetch-User': '?1',
}
response = requests.get('https://www.microsoft.com/en-us/Investor/annual-reports.aspx', cookies=cookies, headers=headers)
scrape_list=[]
if response.ok:
html = response.content
soup = BeautifulSoup(html, "html.parser")
for a in soup.find_all('a', href=True):
if "reports" in a['href'] and "http" in a["href"] and "index" not in a["href"]:
print(f"[{a['href']}]({a['href']})")
scrape_list.append(a["href"])
Gives the following pages to further investigate:
https://www.microsoft.com/en-us/corporate-responsibility/reports-hubhttps://www.microsoft.com/investor/reports/ar22/download-center/
https://www.microsoft.com/investor/reports/ar21/download-center/
https://www.microsoft.com/investor/reports/ar20/download-center/
https://www.microsoft.com/investor/reports/ar19/download-center/
https://www.microsoft.com/en-us/annualreports/ar2018/annualreport
https://www.microsoft.com/investor/reports/ar14/download-center.html
https://www.microsoft.com/investor/reports/ar11/download_center.html
http://www.microsoft.com/investor/reports/ar10/10k_dl_dow.html
http://www.microsoft.com/investor/reports/ar09/10k_dl_dow.html
http://www.microsoft.com/investor/reports/ar08/10k_dl_dow.html
http://www.microsoft.com/investor/reports/ar07/staticversion/10k_dl_dow.html
http://www.microsoft.com/investor/reports/ar06/staticversion/10k_dl_dow.html
http://www.microsoft.com/investor/reports/ar05/staticversion/10k_dl_dow.html
http://www.microsoft.com/investor/reports/ar04/nonflash/default.html
http://www.microsoft.com/investor/reports/ar04/nonflash/default.html
http://www.microsoft.com/investor/reports/ar04/nonflash/10k_dl_main.html
http://www.microsoft.com/investor/reports/ar03/default.htm
http://www.microsoft.com/investor/reports/ar03/default.htm
http://www.microsoft.com/investor/reports/ar03/downloads.htm
http://www.microsoft.com/investor/reports/ar02/default.htm
http://www.microsoft.com/investor/reports/ar02/default.htm
https://www.microsoft.com/investor/reports/ar02/downloads/default.htm
http://www.microsoft.com/investor/reports/ar00/default.htm
http://www.microsoft.com/investor/reports/ar00/default.htm
http://www.microsoft.com/investor/reports/ar00/download.htm
http://www.microsoft.com/investor/reports/ar99/default.htm
http://www.microsoft.com/investor/reports/ar99/default.htm
http://www.microsoft.com/investor/reports/ar99/download.htm
http://www.microsoft.com/investor/reports/ar98/default.htm
http://www.microsoft.com/investor/reports/ar98/default.htm
http://www.microsoft.com/investor/reports/ar98/download.htm
http://www.microsoft.com/investor/reports/ar97/default.htm
http://www.microsoft.com/investor/reports/ar97/default.htm
http://www.microsoft.com/investor/reports/ar97/default.htm
http://www.microsoft.com/investor/reports/ar96/default.htm
http://www.microsoft.com/investor/reports/ar96/default.htm
http://www.microsoft.com/investor/reports/ar96/default.htm
https://www.microsoft.com/investor/reports/ar22/download-center/
Next step is to retrieve all .docx URL's from these links as such:
from random import randint
from time import sleep
from tqdm import tqdm as tqdm
docx_files=[]
for i in tqdm(scrape_list):
sleep(randint(2,5))
response = requests.get(i, cookies=cookies, headers=headers)
if response.ok:
html = response.content
soup = BeautifulSoup(html, "html.parser")
for a in soup.find_all('a', href=True):
if "docx" in a['href'] and "http" in a['href']:
docx_files.append(a['href'])
if not response.ok:
print(response, "for", i)
15%|█▌ | 6/39 [00:01<00:11, 2.96it/s]
<Response [404]> for https://www.microsoft.com/en-us/annualreports/ar2018/annualreport
41%|████ | 16/39 [00:07<00:14, 1.59it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar04/nonflash/default.html
46%|████▌ | 18/39 [00:08<00:13, 1.52it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar03/default.htm
49%|████▊ | 19/39 [00:09<00:13, 1.49it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar03/default.htm
54%|█████▍ | 21/39 [00:10<00:11, 1.50it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar02/default.htm
59%|█████▉ | 23/39 [00:11<00:08, 1.95it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar02/default.htm
62%|██████▏ | 24/39 [00:12<00:08, 1.77it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar00/default.htm
64%|██████▍ | 25/39 [00:12<00:08, 1.64it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar00/default.htm
69%|██████▉ | 27/39 [00:14<00:07, 1.56it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar99/default.htm
72%|███████▏ | 28/39 [00:14<00:07, 1.53it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar99/default.htm
77%|███████▋ | 30/39 [00:16<00:06, 1.48it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar98/default.htm
79%|███████▉ | 31/39 [00:17<00:05, 1.47it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar98/default.htm
85%|████████▍ | 33/39 [00:18<00:04, 1.45it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar97/default.htm
87%|████████▋ | 34/39 [00:19<00:03, 1.45it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar97/default.htm
92%|█████████▏| 36/39 [00:20<00:02, 1.45it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar96/default.htm
95%|█████████▍| 37/39 [00:21<00:01, 1.45it/s]
<Response [404]> for http://www.microsoft.com/investor/reports/ar96/default.htm
100%|██████████| 39/39 [00:22<00:00, 1.77it/s]
docx_files:
['https://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/2022_Annual_Report.docx?version=71ba1987-3b76-28bc-5a24-b81f10f3a7a0',
'https://c.s-microsoft.com/en-us/CMSFiles/2022_Annual_Report.docx?version=71ba1987-3b76-28bc-5a24-b81f10f3a7a0',
'https://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/MSFT_FY22Q4_10K.docx?version=401b18a2-7dfa-3105-fd3b-90c1a5ca04bb',
'https://c.s-microsoft.com/en-us/CMSFiles/MSFT_FY22Q4_10K.docx?version=401b18a2-7dfa-3105-fd3b-90c1a5ca04bb',
'https://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/2022_Shareholder_Letter.docx?version=cb4b090f-97d2-ef69-0b8d-8d5b061d45aa',
'https://c.s-microsoft.com/en-us/CMSFiles/2022_Shareholder_Letter.docx?version=cb4b090f-97d2-ef69-0b8d-8d5b061d45aa',
'http://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/2021_Annual_Report.docx?version=5290c17d-8858-c9ef-d16f-60e02f42214e',
'https://c.s-microsoft.com/en-us/CMSFiles/2021_Annual_Report.docx?version=5290c17d-8858-c9ef-d16f-60e02f42214e',
'https://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/MSFT_FY21Q4_10K.docx?version=01062c71-0508-22e2-9fb8-9ecf51bb6378',
'https://c.s-microsoft.com/en-us/CMSFiles/MSFT_FY21Q4_10K.docx?version=01062c71-0508-22e2-9fb8-9ecf51bb6378',
'http://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/2021_Shareholder_Letter.docx?version=2f3bdd2d-4568-e98a-645d-713ac2461978',
'https://c.s-microsoft.com/en-us/CMSFiles/2021_Shareholder_Letter.docx?version=2f3bdd2d-4568-e98a-645d-713ac2461978',
'https://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/2020_Annual_Report.docx?version=8a3ca1db-2de7-c0e7-d7c5-176c412a395e',
'https://c.s-microsoft.com/en-us/CMSFiles/2020_Annual_Report.docx?version=8a3ca1db-2de7-c0e7-d7c5-176c412a395e',
'http://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/MSFT_FY20Q4_10K.docx?version=71873a68-d431-e887-124f-4d24b9ade60c',
'https://c.s-microsoft.com/en-us/CMSFiles/MSFT_FY20Q4_10K.docx?version=71873a68-d431-e887-124f-4d24b9ade60c',
'https://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/2020_Shareholder_Letter.docx?version=e783c2cc-537c-fdb7-5a0c-73a848250f05',
'https://c.s-microsoft.com/en-us/CMSFiles/2020_Shareholder_Letter.docx?version=e783c2cc-537c-fdb7-5a0c-73a848250f05',
'https://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/MSFT_FY19Q4_10K.docx?version=0a785912-1d8b-1ee0-f8d8-63f2fb7a5f00',
'https://c.s-microsoft.com/en-us/CMSFiles/MSFT_FY19Q4_10K.docx?version=0a785912-1d8b-1ee0-f8d8-63f2fb7a5f00',
'https://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/2019_Shareholder_Letter.docx?version=56169a49-efd1-27be-1777-6c36b3426da1',
'https://c.s-microsoft.com/en-us/CMSFiles/2019_Shareholder_Letter.docx?version=56169a49-efd1-27be-1777-6c36b3426da1',
'http://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/2019_Proxy_Statement.docx?version=eae4affc-fe10-b796-c108-566255610e0f',
'http://view.officeapps.live.com/op/view.aspx?src=http://www.microsoft.com/investor/reports/ar14/docs/2014_Annual_Report.docx',
'http://view.officeapps.live.com/op/view.aspx?src=http://www.microsoft.com/investor/reports/ar14/docs/MSFT_FY14Q4_10K.docx',
'http://view.officeapps.live.com/op/view.aspx?src=http://www.microsoft.com/investor/reports/ar14/docs/2014_Shareholder_Letter.docx',
'http://www.microsoft.com/investor/Downloads/Investor Services/Information for Investors/2011_Proxy_Statement.docx',
'http://www.microsoft.com/investor/Downloads/Investor Services/Information for Investors/2010_Proxy_Statement.docx',
'http://cid-4910e8dd2e872bb2.office.live.com/view.aspx/FY2010/Microsoft%202010%2010K.docx',
'http://cid-4910e8dd2e872bb2.office.live.com/view.aspx/FY2010/Microsoft%202010%20Annual%20Report.docx',
'https://cid-4910e8dd2e872bb2.office.live.com/view.aspx/FY2010/2010%20Letter%20to%20Shareholders.docx',
'https://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/2022_Annual_Report.docx?version=71ba1987-3b76-28bc-5a24-b81f10f3a7a0',
'https://c.s-microsoft.com/en-us/CMSFiles/2022_Annual_Report.docx?version=71ba1987-3b76-28bc-5a24-b81f10f3a7a0',
'https://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/MSFT_FY22Q4_10K.docx?version=401b18a2-7dfa-3105-fd3b-90c1a5ca04bb',
'https://c.s-microsoft.com/en-us/CMSFiles/MSFT_FY22Q4_10K.docx?version=401b18a2-7dfa-3105-fd3b-90c1a5ca04bb',
'https://view.officeapps.live.com/op/view.aspx?src=https://c.s-microsoft.com/en-us/CMSFiles/2022_Shareholder_Letter.docx?version=cb4b090f-97d2-ef69-0b8d-8d5b061d45aa',
'https://c.s-microsoft.com/en-us/CMSFiles/2022_Shareholder_Letter.docx?version=cb4b090f-97d2-ef69-0b8d-8d5b061d45aa']
Last step is to download these .docx files:
def download_file(url):
local_filename = url.split('/')[-1]
#If desired you can save it to some directory of interest as local_filename="/dir/of/interest",local_filename or feed it to function external
with requests.get(url, stream=True) as r:
r.raise_for_status()
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=8192):
f.write(chunk)
return local_filename
for i in docx_files:
download_file(i)