1

Ok so I just want to know what my best plan of action is here and what tools/frameworks would I need:

1. Log onto webpage

2. Navigate to desired page which would require clicking on buttons and then filling in text boxes for searching

3-4 On loop

3. Grab the html from the page and store in local txt file

4. Analyze the text file and if string matches certain criteria, notify me via email that a match has been found

My though process was to use scrapy to get the data, but I wasnt sure how to navigate the page and provide input(such as login credentials and button navigation) Which made me want use selenium(use it at work so I'm fairly comfortable with it) but I'm not sure if that's the best way.

Thanks for any guidance!

Tyler Kelly
  • 564
  • 5
  • 23

2 Answers2

2

A lot of the time, "Clicking on Buttons" and "Filling In Forms" doesn't require you to actually do any of those things. It's just the method that the browser uses to get data from you, and then submits it to the server via a POST. You can actually do those POST requests directly.

With Javascript, the same thing is happening, it just submits the POST without reloading the page, and modifies the current page with the new data.

For a majority of the cases, you can just figure out where the POST is being made to, and what fields you need to fill in, and then do it yourself. Some good starting points would be Using FormRequest.from_response() to simulate a user login, and this SO Scrapy/Ajax question.

This will allow you to simplify and stick with just Scrapy, instead of fetching entire page contents with Selenium, and passing data to Scrapy in files, all of which would be significantly slower.

As an aside, if you do want to go with Selenium, and want something to parse the data after, don't go with Scrapy. It's a full fledged framework, and a poor fit to just parse HTML. Instead, use it's parsing library, parsel eLRuLL mentioned, or use BeautifulSoup4` (documentation and homepage is here)

Community
  • 1
  • 1
Rejected
  • 4,445
  • 2
  • 25
  • 42
1

sure Selenium is I think one of the best choices for this scenario. you can also try to replicate the login requests with scrapy, but you'll need to know which are the requests, headers, cookies necessary for a correct crawling using scrapy (because scrapy doesn't provide a browser automation like Selenium).

For parsing the body, of course scrapy is the best choice, but you could also just use parsel to use only Selectors.

for sending an email you should configure a smtp client, this article explains it better.

eLRuLL
  • 18,488
  • 9
  • 73
  • 99
  • thanks, I guess I'll go the selenium route and then I guess I'll just download the html file and then parse over it for the data I want using some java or python. I've only worked with the selenium version for java but i was wanting to do this project in python, big learning curve to switch? – Tyler Kelly Nov 11 '15 at 04:49
  • 1
    I disagree wholeheartedly that "`scrapy` is the best choice" for parsing the HTML. You should go with a parsing library, not a complete crawling framework. – Rejected Nov 11 '15 at 05:14
  • yeah, that's why I recommend `parsel` or any other parsing library. @user3470987 this [link](http://stackoverflow.com/questions/17975471/selenium-with-scrapy-for-dynamic-page) could give you an simple example of using selenium with scrapy – eLRuLL Nov 11 '15 at 11:14