0

I am currently attempting to scrape this website https://schedule.townsville-port.com.au/

I would like to scrape the text in all the individual tooltips.

Here is what the html for the typical element I have to hover looks like

<div event_id="55591" class="dhx_cal_event_line past_event" style="position:absolute; top:2px; height: 42px; left:1px; width:750px;"><div> 

Here is what the typical html for the tooltip looks like

<div class="dhtmlXTooltip tooltip" style="visibility: visible; left: 803px; bottom:74px;

I have tried various combinations such as attempting to scrape the tooltips directly and also attempting to scrape the html by hovering over where I need to hover.

tool_tips=driver.find_elements_by_class_name("dhx_cal_event_line past_event")

tool_tips=driver.find_elements_by_xpath("//div[@class=dhx_cal_event_line past_event]")

tool_tips=driver.find_element_by_css_selector("dhx_cal_event_line past_event")

I have also attempted the same code with "dhtmlXTooltip tooltip" instead of "dhx_cal_event_line past_event"

I really don't understand why.

tool_tips=driver.find_elements_by_class_name("dhx_cal_event_line past_event")

Doesn't work.

Can Beautifulsoup be used to tackle this? Since the html is dynamic and changing?

Jay Haran
  • 123
  • 1
  • 9
  • You need to implement [ActionChains](http://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.common.action_chains). `find_elements_by_class_name("dhx_cal_event_line past_event")` doesn't work because compound class names are not permitted. Also the correct CSS selector is `find_elements_by_css_selector(".dhx_cal_event_line.past_event")` – Andersson Apr 20 '18 at 10:29
  • If atall considering `Beautifulsoup` why are you not tagging `Beautifulsoup` but you have tagged `Selenium` – undetected Selenium Apr 20 '18 at 10:30
  • Thank you I have tagged it now. – Jay Haran Apr 20 '18 at 10:35
  • @Andersson reading through the documentation for ActionChains it seems I still need to find the elements and how can I find the element? I tried find_elements_by_css_selector(".dhx_cal_event_line.past_event" but that also doesn't find the element. I get no 'No such element' exception. – Jay Haran Apr 20 '18 at 11:08
  • this is because those elements generated dynamically, so you also need to implement [Wait](http://selenium-python.readthedocs.io/waits.html) – Andersson Apr 20 '18 at 11:18

1 Answers1

0

If you open Network tab in Chrome DevTools and filter by XHR you can see that the website makes a request to http://schedule.townsville-port.com.au/spotschedule.php.

from bs4 import BeautifulSoup
import requests

url = 'http://schedule.townsville-port.com.au/spotschedule.php'
r = requests.get(url, verify=False)
soup = BeautifulSoup(r.text, 'xml')

transports = {}
events = soup.find_all('event')

for e in events:
    transport_id = e['id']
    transport = {child.name: child.text for child in e.children}
    transports[transport_id] = transport

import pprint
pprint.pprint(transports)

Output:

{'48165': {'IMO': '8201480',
       'app_rec': 'Approved',
       'cargo': 'Passenger Vessel (Import)',
       'details': 'Inchcape Shipping Services Pty Limited',
       'duration': '8',
       'end_date': '2018-02-17 14:03:00.000',
       'sectionID': '10',
       'start_date': '2018-02-17 06:44:00.000',
       'text': 'ARTANIA',
       'visit_id': '19109'},
 ...
}

The only way I found to get rid off SSLError was to disable certificate verification with verify=False, you can read more about it here.

Notice that start_date and end_date are UTC times, so you can either specify timeshift query param:

import time

utc_offset = -time.localtime().tm_gmtoff // 60  # in minutes    
url = f'http://schedule.townsville-port.com.au/spotschedule.php?timeshift={utc_offset}'

or convert dates and store them as datetime objects (you can read about converting time from UTC to your local timezone here).

radzak
  • 2,986
  • 1
  • 18
  • 27