1

I want to scrape a GIS city website for project names active in town of Brighton. https://brighton.maps.arcgis.com/apps/webappviewer/index.html?id=2e3dacc6615e4cf59b6db043cc3f12cc

However, I can't seem to bypass the initial Terms & Agreements checkbox. I'm still new to webscraping so I'm not sure where to begin with this one (outside of the typical imports & requests):

import requests
from bs4 import Beautifulsoup

URL = "https://brighton.maps.arcgis.com/apps/webappviewer/index.html?id=2e3dacc6615e4cf59b6db043cc3f12cc"
content = requests.get(URL)
soup = BeautifulSoup(content.text, "lxml")

I tried to follow this question: How to bypass Terms and Conditions agreement with Beautiful Soup, however, this is a totally different scenario. I feel confident I'll be able to figure out the scraping portion; it's just the "Terms and Agreements" prompt I can't get past. Please help I'm desperate!

Bjohnk
  • 13
  • 3
  • The header from your linked SO post is important – Daraan Oct 18 '22 at 21:59
  • BeautifulSoup is a parser. Trying to interact with a T&C prompt in BeautifulSoup is like trying to walk through the front door of the blueprints for a building. – user2357112 Oct 18 '22 at 22:05
  • the html is generated by javascript, you are going to have a hard time finding the html just from a simple http request. you can either find the underlying api behind the website or use something like selenium to get the actual html, and then use bs4 to parse. – toppk Oct 18 '22 at 22:06
  • Ah understood, alright, this is very helpful actually. Thanks for going easy on me for the noob question haha! – Bjohnk Oct 18 '22 at 22:26

1 Answers1

0

No reason to bypass the checkbox as you are interested in the content anyway.

You can right-click the page, select inspect and then the network tab on the right side. Here you can see all the requests your browsers sends to load the page. As you can see its quite a lot. If you are using requests, you have to mimic this behavior. It seems like the data you are probably looking for is actually loaded from a different url.

r = requests.get("https://brighton.maps.arcgis.com/sharing/rest/content/items/2e3dacc6615e4cf59b6db043cc3f12cc/data?f=json").json()

This way you get a dictionary that you can work with. What information exactly are you interested in?

An alternative to requests is the selenium package, that simulates/controls a browser and lets you click on elements etc.

bitflip
  • 3,436
  • 1
  • 3
  • 22
  • I'm looking to create a list of Project Names for each of the symbols shown on the map. A description pops up when you click on any of the symbols and I can trace it with the "inspect" tool. However, unfortunately, I don't have access to these elements unless the I can bypass the "Terms and Conditions" prompt. The HTML elements look different before/after you accept this prompt. For example, there are no html tags when you inspect the page before accepting the "Terms and Conditions" prompt. – Bjohnk Oct 18 '22 at 22:16
  • If you're not already doing so, you need to observe the interactions the web page makes with your browser when you accept the Terms and Conditions, using a tool such as Telerik Fiddler or Wireshark. Once you've worked out the behavior, you might be able to mimic it in your scraper. Be aware that the website may have put measures in place that make it difficult or impossible to defeat. – Robert Harvey Oct 18 '22 at 22:19
  • Gotcha, thanks for clarifying. I'm still new to this so this makes sense and clarifies BitFlip's answer as well. Thanks again for the answer! – Bjohnk Oct 18 '22 at 22:29