9

When I make a get request to this url: http://www.waterwaysguide.org.au/waterwaysguide/access-point/4980/partial with a browser a full html page is returned. However when I make a GET request with the python requests module only a part of the html is returned and the core content is missing.

How do I change my code so that I can get the data that is missing?

This is the code I am using;

import requests
def get_data(point_num):
    base_url = 'http://www.waterwaysguide.org.au/waterwaysguide/access-point/{}/partial'
    r = requests.get(base_url)
    html_content = r.text
    print(html_content)
get_data(4980)

The result of running the code is shown below. The content inside the div class="view view-waterway-access-point-page... is missing.

<div>
  <div class="modal-header">
    <button type="button" class="close" data-dismiss="modal" aria-label="Close">
      <span aria-hidden="true">&times;</span>
    </button>
    <h4 class="modal-title">
        Point of Interest detail    </h4>

  </div>
  <div class="modal-body">
    <div class="view view-waterway-access-point-page view-id-waterway_access_point_page view-display-id-page view-dom-id-c855bf9afdfe945979f96b2301d55784">
        
  
  
  
  
  
  
  
  
</div>  </div>
  <div class="modal-footer">
    
    <button type="button" id="closeRemoteModal" class="btn btn-action" data-dismiss="modal">Close</button>
  </div>
</div>
Andrew Houghton
  • 111
  • 1
  • 1
  • 4

3 Answers3

4

The following approach displays the missing content inside the div class="view view-waterway-access-point-page...

>>> from urllib.request import Request, urlopen
>>> from bs4 import BeautifulSoup
>>> url = 'http://www.waterwaysguide.org.au/waterwaysguide/access-
point/4980/partial'
>>> req = Request(url,headers={'User-Agent': 'Mozilla/5.0'})
>>> webpage = urlopen(req).read()
>>> print(webpage)
Ashok Kumar Jayaraman
  • 2,887
  • 2
  • 32
  • 40
2

I found the error that I had made. I never used the 'point_num' argument that I pass to the function so my request was not going to the correct url.

The code is working now that I have changed the line to

r = requests.get(base_url.format(point_num))
Andrew Houghton
  • 111
  • 1
  • 1
  • 4
0

It might be the case that the elements are rendered using javascript AFTER the page has loaded. So, you only get the page and not the javascript rendered parts.
You might want to look into

https://medium.com/@hoppy/how-to-test-or-scrape-javascript-rendered-websites-with-python-selenium-a-beginner-step-by-c137892216aa

Web-scraping JavaScript page with Python

Anuj Gautam
  • 1,235
  • 1
  • 7
  • 14