BeautifulSoup -ing a website with login and site search engine

Question

I'm trying to scrape International Maritime Organization's data (https://gisis.imo.org/Public/PAR/Search.aspx) on shipping vessel attacks between the dates ("is between" in the site's search engine) 2002-01-01, 2005-12-31.

Fill in the dates and click add

I've used bs4 and requests modules in python previously to scrape financial data from yahoo, and weather data from wunderground, but this site requires a login and password (under the "public" account type). Furthermore, as I said the data requires a search / filter before I can access the html on the page:

Once I click on a row here, it expands to the image below. (Before anyone asks why I don't just download the dataset and pull from there: the DL is for some reason filtered, and not all the columns are given out (for example, the IMO number).

enter image description here

ULTIMATELY THE DATA I AM TRYING TO PULL IS FROM THIS PAGE, and I need (item, css path):

position of incident

#ctl00_bodyPlaceHolder_ctl00_pnlDetail > table:nth-child(4) > tbody > tr:nth-child(1) > td:nth-child(2) > span

date

#ctl00_bodyPlaceHolder_ctl00_pnlDetail > table:nth-child(4) > tbody > tr:nth-child(6) > td.content > span

ship name

#ctl00_bodyPlaceHolder_ctl00_pnlDetail > table:nth-child(4) > tbody > tr:nth-child(4) > td:nth-child(2) > span

Needless to say this seems like a daunting task. Any recommendations?

Here is the OLD code I've been using to scrape the weather data (haven't changed anything yet because I don't know where to start in terms of the login/filter process: http://pythonfiddle.com/get-wx-data

@salmanwahed edited and added the pyfiddle. I haven't changed it from my previous scraper yet because I honestly don't even know where to start. I can point the scraper at the css elements, but how do I get the site to go through the search/filter process? — d8aninja, Oct 10 '14 at 18:25
Hi d8aninja, I am trying to do the same, did you manage to scrape this website? — vivirbr, May 01 '20 at 06:36

score 1 · Accepted Answer · answered Oct 10 '14 at 21:13

requests alone isn't going to be enough. You'll want to look into mechanize: http://wwwsearch.sourceforge.net/mechanize/

The nice thing about mechanize is that it maintains state from page to page, unlike requests. (You probably could do it with just requests, but I'm not quite that clever.) Here's an example of a simple login interaction.

This would be awesome, if the IMO site were that easy. Instead, it's ASP-based, and that means it's relatively irritating to scrape. Some of the details will vary from site to site, so I'll suggest two things in particular: looking at the Network tab of your browser's developer tools and reading this ScraperWiki post on dealing with ASP sites.

Best of luck!

much thanks. I'm still working on editing the script I have been using (OP) but this is good info. — d8aninja, Oct 11 '14 at 19:33

BeautifulSoup -ing a website with login and site search engine

1 Answers1

Linked