0

I am tring to scrape data from this webpage : marine traffic

I did try normal scraping in python and Selenium but I can't figure out any of the target data. (latitude/longitude/speed)

enter image description here

Is there a special format that I am missing ?

This is the code I started with

from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless') 
driver = webdriver.Chrome("C:/webdrivers/chromedriver.exe", options=options)
page = driver.page_source

But by making a simple search of text with CTRL + F I can't find anything satisfying.

Any idea of how to scrape it down ?

Thanks

M-Wane
  • 134
  • 13

3 Answers3

5

If you view the page in a browser, and log your browser's network traffic, you'll notice some XHR HTTP GET requests being made to various API endpoints, the response of which is JSON and contains the information you're looking for. All you have to do is imitate those requests - no BeautifulSoup or Selenium required:

def get_ship_position(ship_id):
    import requests

    url = "https://www.marinetraffic.com/vesselDetails/latestPosition/shipid:{}".format(ship_id)

    headers = {
        "accept": "application/json",
        "accept-encoding": "gzip, deflate",
        "user-agent": "Mozilla/5.0",
        "x-requested-with": "XMLHttpRequest"
    }

    response = requests.get(url, headers=headers)
    response.raise_for_status()

    return response.json()


def main():

    from datetime import datetime

    data = get_ship_position("371441")
    ts = datetime.utcfromtimestamp(data["lastPos"])
    print("Last known position: {} / {} @ {}".format(data["lat"], data["lon"], ts))
    
    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output:

Last known position: -1.53057 / -48.77838 @ 2021-08-04 10:33:33
>>> 
Paul M.
  • 10,481
  • 2
  • 9
  • 15
  • Wow man thanks a bunch very powerful technique. I am not familiar with XHR, is there a good tutorial to master it you could share ? – M-Wane Aug 04 '21 at 18:09
  • @M-Wane Take a look at [this other answer](https://stackoverflow.com/questions/61049188/how-do-i-get-this-information-out-of-this-website/61051360#61051360) I posted, where I go more in-depth into logging your network traffic, finding REST API endpoints and mimicking requests. – Paul M. Aug 04 '21 at 19:50
  • @M-Wane other than that, just learn about the Developer Tools of your browser (like Google Chrome's Devtools). Learn how to log your network traffic, learn about the HTTP protocol and how modern websites use JavaScript to populate the DOM asynchronously. – Paul M. Aug 04 '21 at 20:15
0

First, for using Selenium in headless mode, you have to define screen size

options.add_argument('--window-size=1920,1080')

To get the coordinates and speed you can use this:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(driver, 20)

coordinates = wait.until(EC.visibility_of_element_located((By.XPATH, "//p[contains(text(),'Latitude')]/b"))).text

speed =  wait.until(EC.visibility_of_element_located((By.XPATH, "//p[contains(text(),'Speed')]/b"))).text

Also, since you are using headless mode, these settings may be usefull

options.add_argument('--no-sandbox')
options.add_argument('----disable-dev-shm-usage')
Prophet
  • 32,350
  • 22
  • 54
  • 79
  • Thanks a lot for your quick answer. Sadly the code returns an error. "selenium.common.exceptions.TimeoutException: Message:" Any idea on what could be the matter ? – M-Wane Aug 04 '21 at 17:35
  • it does for this line : coordinates = wait.until(EC.visibility_of_element_located((By.XPATH, "//p[contains(text(),'Latitude')]/b"))).text – M-Wane Aug 04 '21 at 17:59
  • Not instead of it, just to debug, can you try `wait.until(EC.presence_of_element_located((By.XPATH, "//p[contains(text(),'Latitude')]/b")))` ? – Prophet Aug 04 '21 at 18:03
  • Sorry but same problem – M-Wane Aug 04 '21 at 18:07
  • let's remove `options.add_argument('--headless')` to make it run in normal mode. Do you see the page opened normally? Do you have a command `driver.get("https://www.marinetraffic.com/en/ais/details/ships/shipid:371441/mmsi:310554000/imo:9312456/vessel:STENA_PERROS")` at all? – Prophet Aug 04 '21 at 18:21
0

There are few things,

  1. You will need to click on Accept cookies button.
  2. You will need to click on X button which is visible sometime and sometime don't.
  3. You need explicit waits as well.

Sample code :

options = webdriver.ChromeOptions()
options.add_argument("--disable-infobars")
options.add_argument("--start-maximized")
options.add_argument("--disable-extensions")
options.add_experimental_option("prefs", {"profile.default_content_setting_values.notifications": 2})
options.add_argument('--window-size=1920,1080')
options.add_argument("--headless")
options.add_experimental_option("prefs", {"profile.default_content_settings.cookies": 2})
driver = webdriver.Chrome(options = options)
driver.implicitly_wait(30)
driver.maximize_window()
driver.get("https://www.marinetraffic.com/en/ais/details/ships/shipid:371441/mmsi:310554000/imo:9312456/vessel:STENA_PERROS")
wait = WebDriverWait(driver, 20)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[aria-label='AGREE']"))).click()
try:
    if(len(driver.find_elements(By.XPATH, "//*[name()='svg' and @class='MuiSvgIcon-root']/ancestor::button[contains(@class,'jss17')]"))) >0:
        print("X is visible")
        wait.until(EC.visibility_of_element_located((By.XPATH, "//*[name()='svg' and @class='MuiSvgIcon-root']/ancestor::button[contains(@class,'jss17')]"))).click()
        print("done clicking")
    else:
        print("X was not visible")
except:
    print("something went wrong")
    pass

print(wait.until(EC.visibility_of_element_located((By.XPATH, "//b//a[contains(@href,'/en/ais/hom')]"))).text)
print(wait.until(EC.visibility_of_element_located((By.XPATH, "//b//a[contains(@href,'/en/ais/hom')]/ancestor::p/following-sibling::p/b"))).text)
print(wait.until(EC.visibility_of_element_located((By.XPATH, "//b//a[contains(@href,'/en/ais/hom')]/ancestor::p/following-sibling::p[2]/b"))).text)

Imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Output :

X is visible
done clicking
-1.53057° / -48.77838°
Underway using Engine
1.7 kn / 250 °

Process finished with exit code 0
cruisepandey
  • 28,520
  • 6
  • 20
  • 38