0

I'm using Python's request module to scrape this website: http://reports.ieso.ca/public/Adequacy2/PUB_Adequacy2_20200114.xml

import requests

def get_info(date=None):
    headers = {
        "Content-Type": "text/html"
    }

    response = requests.get('http://reports.ieso.ca/public/Adequacy2/PUB_Adequacy2_20200114.xml', headers=headers,verify=False)
    print(response.text)
    return response

get_info()

Now it returns XML, which I understand. But the HTML structure I see when I inspect that website is different, and much better in it's structure.

Is there a way to get that data with requests instead of the XML data? Or other alternatives?

0m3r
  • 12,286
  • 15
  • 35
  • 71
Amon
  • 2,725
  • 5
  • 30
  • 52
  • 1
    I think the website has some JS code turning that API response (xml) into the html you’re seeing with your browser. – arnaud Apr 14 '20 at 21:30
  • Yea I see some `xsl` tags in the code actually, would that be it? Is there no way to retrieve the final result I see in the browser? – Amon Apr 14 '20 at 21:50
  • Does this answer your question? [Web-scraping JavaScript page with Python](https://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python) – AMC Apr 15 '20 at 00:58
  • _But the HTML structure I see when I inspect that website is different, and much better in it's structure._ In what way is it better? Usually direct access to the data is far more desirable than having parse a bunch of a HTML. The XML seems just fine to me. – AMC Apr 15 '20 at 01:01
  • Really? It's arranged in tables with rows in the HTML. Seems much more intuitive – Amon Apr 15 '20 at 02:21

1 Answers1

-1

I think beautiful soup might do what you are asking.

Install beautiful soup

pip3 install beautifulsoup4

"soup" object hopefully parses to what you are expecting

import requests
from bs4 import BeautifulSoup

URL = 'https://www.monster.com/jobs/search/?q=Software-Developer&where=Australia'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')
Maxqueue
  • 2,194
  • 2
  • 23
  • 55
  • Yes this came to mind, I wasn't sure because the source seems to be only XML. This or Selenium, but if it's possible in these why not in `requests`? What are they doing different? – Amon Apr 14 '20 at 20:55
  • 1
    ah sorry wasn't sure if this would work. Like you said it was first thing that came to mind. – Maxqueue Apr 14 '20 at 21:01
  • I actually found another URL by inspecting: http://reports.ieso.ca/docrefs/stylesheet/Adequacy2_HTML_t1-3.xsl But it gives my some xsl markup? I have no idea where the browser gets the HTML – Amon Apr 14 '20 at 21:07
  • BeautifulSoup is just an HTML parser, why would it make a difference? – AMC Apr 15 '20 at 00:58
  • @AMC Because html is not the same as xml – Maxqueue Apr 15 '20 at 02:31
  • @Maxqueue Right, but if the data returned by the request isn’t what we need, then parsing that same data won’t change much. – AMC Apr 15 '20 at 02:32