What I want to do is to scrape the following site https://wiki.openstreetmap.org/wiki/Key:office and specifically the table containing all the tags so everything contained within:
<table class="wikitable taginfo-taglist">...<\table>
since everything within:
<div class="taglist" ...> ... <\div>
(the parent of the table) is generated by JavaScript I thought this code could work:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
options = Options()
options.add_argument("--headless")
caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = True
driver = webdriver.Firefox(options=options, capabilities=caps, executable_path='../statics/geckodriver')
def get_tag_soup(url):
driver.get(url)
try:
table = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME , "wikitable taginfo-taglist")))
soup = BeautifulSoup(table.get_attribute('innerHTML'), 'lxml')
except Exception as e:
soup = e
return soup
get_tag_soup('https://wiki.openstreetmap.org/wiki/Key:office')
But when I run this code I just get an selenium.common.exceptions.TimeoutException('', None, None)
more frustratingly some times if I WebDriverWait
for the parent of "wikitable taginfo-taglist"
with EC.presence_of_element_located((By.CLASS_NAME , "taglist"))
it works.