3

I want to download the Excel file on this ONS webpage using the MechanicalSoup package in Python. I have read the MechanicalSoup documentation. I have searched extensively for an example to follow, on StackOverflow and elsewhere, without luck.

My attempt is:

# Install dependencies
# pip install requests
# pip install BeautifulSoup4
# pip install MechanicalSoup

# Import libraries
import mechanicalsoup
import urllib.request
import requests
from bs4 import BeautifulSoup

# Create a browser object that can collect cookies
browser = mechanicalsoup.StatefulBrowser()

browser.open("https://www.ons.gov.uk/economy/grossdomesticproductgdp/timeseries/l2kq/qna")

browser.download_link("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna")

In that last line, I have also tried:

browser.download_link(link="https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna",file="c:/test/filename.xls")

Update 25 Jan 2019: And thanks to AKX's comment below, I've tried

browser.download_link(re.escape("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna"))

In each case, I get the error:

mechanicalsoup.utils.LinkNotFoundError

Yet the link does exist. Try pasting this into your address bar to confirm:

https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna

What am I doing wrong?

Update 2, 25 Jan 2019: Thanks to AKX's answers below, this is the full MWE that answers my question (posting for anyone who encounters the same difficulty later):

# Install dependencies
# pip install requests
# pip install BeautifulSoup4
# pip install MechanicalSoup

# Import libraries
import mechanicalsoup
import urllib.request
import requests
from bs4 import BeautifulSoup
import re

# Create a browser object that can collect cookies
browser = mechanicalsoup.StatefulBrowser()

browser.open("https://www.ons.gov.uk/economy/grossdomesticproductgdp/timeseries/l2kq/qna")

browser.download_link(link_text=".xls",file="c:/py/ONS_Data.xls" )
Ian
  • 33
  • 4

2 Answers2

1

I haven't used Mechanical Soup, but looking at the docs,

This function behaves similarly to follow_link()

and follow_link says (emphasis mine)

  • If link is a bs4.element.Tag (i.e. from a previous call to links() or find_link()), then follow the link.
  • If link doesn’t have a href-attribute or is None, treat link as a url_regex and look it up with find_link(). Any additional arguments specified are forwarded to this function.

Question marks (among other things) are regular expression (regex) metacharacters, so you'll want to escape them if you want to use them for follow_link/download_link:

import re
# ...
browser.download_link(re.escape("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna"))

However, if the first page you visit doesn't contain that direct link, I'm not sure it'll help anyway. (Do try first though.)

You might be able to use the browser's underlying requests session that probably hosts the cookie jar (assuming some cookies are required for the download) to directly download the file:

resp = browser.session.get("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna")
resp.raise_for_status()  # raise an exception for 404, etc.
with open('filename.xls', 'wb') as outf:
  outf.write(resp.content)
AKX
  • 152,115
  • 15
  • 115
  • 172
  • Your second suggestion (resp = browser.session.get) worked, thank you for showing it to me. Your first suggestion (re.escape) resulted in the error "TypeError: escape() got an unexpected keyword argument 'file'". Do you know why? – Ian Jan 24 '19 at 17:36
  • You probably did `download_link(re.escape(..., file=...))`, not `download_link(re.escape(...), file=...)`, i.e. you passed the `file=` that should have gone to `download_link`, to the `re.escape` function, which only takes one parameter. – AKX Jan 24 '19 at 22:29
  • Yes that's right, thank you for helping me fix my mistake! Unfortunately, after removing the `file=` bit, I now get the original error `mechanicalsoup.utils.LinkNotFoundError`. Is there anything else I could try to do this with MechanicalSoup's `browser.download_link()` command? – Ian Jan 25 '19 at 09:28
  • `browser.download_link()` works like `follow_link()`, i.e. (as far as I can see) it tries to look for a link that has the given text on the page the virtual browser is currently on. Since the label for that download link is ".xls", try `.download_link(link_text=".xls")` maybe? – AKX Jan 25 '19 at 11:59
  • AKX, you are an unrivaled genius. Thank you. This slight modification of your suggestion above worked: `browser.download_link(link_text=".xls",file="c:/py/ONS_Data.xls" )` – Ian Jan 25 '19 at 13:14
0

You are making a confusion between a link (an element in a webpage like <a href=... >) and a URL (a string of the form http://example.com). MechanicalSoup's follow_link looks for links in the page and follows it, as if you had clicked on it in your browser.

  • Perhaps I am confused. If so, could you please tell me what I should put inside `download_link()`? (Also, correct me if I'm wrong, but I believe you could put `"https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna"` inside an `` tag, and that would allow a user to follow the link by clicking on it in their browser.) – Ian Jan 25 '19 at 09:37