0

I'm trying to scrape International Maritime Organization's data (https://gisis.imo.org/Public/PAR/Search.aspx) on shipping vessel attacks between the dates ("is between" in the site's search engine) 2002-01-01, 2005-12-31.

Fill in the dates and click add

I've used bs4 and requests modules in python previously to scrape financial data from yahoo, and weather data from wunderground, but this site requires a login and password (under the "public" account type). Furthermore, as I said the data requires a search / filter before I can access the html on the page:

Once I click on a row here, it expands to the image below. (Before anyone asks why I don't just download the dataset and pull from there: the DL is for some reason filtered, and not all the columns are given out (for example, the IMO number).

enter image description here

ULTIMATELY THE DATA I AM TRYING TO PULL IS FROM THIS PAGE, and I need (item, css path):

  • position of incident

    #ctl00_bodyPlaceHolder_ctl00_pnlDetail > table:nth-child(4) > tbody > tr:nth-child(1) > td:nth-child(2) > span
    
  • date

    #ctl00_bodyPlaceHolder_ctl00_pnlDetail > table:nth-child(4) > tbody > tr:nth-child(6) > td.content > span
    
  • ship name

    #ctl00_bodyPlaceHolder_ctl00_pnlDetail > table:nth-child(4) > tbody > tr:nth-child(4) > td:nth-child(2) > span
    

Needless to say this seems like a daunting task. Any recommendations?

Here is the OLD code I've been using to scrape the weather data (haven't changed anything yet because I don't know where to start in terms of the login/filter process: http://pythonfiddle.com/get-wx-data

d8aninja
  • 3,233
  • 4
  • 36
  • 60
  • @salmanwahed edited and added the pyfiddle. I haven't changed it from my previous scraper yet because I honestly don't even know where to start. I can point the scraper at the css elements, but how do I get the site to go through the search/filter process? – d8aninja Oct 10 '14 at 18:25
  • Hi d8aninja, I am trying to do the same, did you manage to scrape this website? – vivirbr May 01 '20 at 06:36

1 Answers1

1

requests alone isn't going to be enough. You'll want to look into mechanize: http://wwwsearch.sourceforge.net/mechanize/

The nice thing about mechanize is that it maintains state from page to page, unlike requests. (You probably could do it with just requests, but I'm not quite that clever.) Here's an example of a simple login interaction.

This would be awesome, if the IMO site were that easy. Instead, it's ASP-based, and that means it's relatively irritating to scrape. Some of the details will vary from site to site, so I'll suggest two things in particular: looking at the Network tab of your browser's developer tools and reading this ScraperWiki post on dealing with ASP sites.

Best of luck!

myersjustinc
  • 714
  • 1
  • 7
  • 15