1

A contractor has provided us with survey data for a set of stores. The data contains the store numbers, thumbnail images and large images. The data is accessed through the contractor's secured website. In order to build a report for the data, I am trying to scrape the store numbers and images from the site instead of manually downloading each image.

I have not used CFhttp for secured sites, but have had a little success so far with:

<cfhttp 
    method="post" 
    url="http://www.website.com/impart/client_login.php"
    throwonerror="Yes"
    redirect = "yes"
    resolveUrl = "yes">

    <cfhttpparam name="user" value="myUsername" type="formfield">
    <cfhttpparam name="pass" value="myPassword" type="formfield">
    <cfhttpparam name="submit" value="Login" type="formfield">

How do I proceed from getting passed the authentication to the page that contains the image to download?

aparker81
  • 263
  • 1
  • 5
  • 23
  • You'll need to know more about (and then relay here) your third-party site's authentication before a complete answer can be provided. You may luck out and be able to monitor the creation of one or more cookies upon successfully logging in to their site by hand--and if so--use the names (and values) of those cookies for subsequent cfhttp calls to secure pages. You'll need to know definitely, first...otherwise, answers will be based off of pure speculation. – Shawn Holmes Jan 05 '12 at 21:07

2 Answers2

1

I think that CFHTTP may not be the best choice for this. I am good at BASH, so I would tend towards scripting it with curl, but maybe some product on this page would be easier http://www.timedicer.co.uk/web-scraping ?

speeves
  • 1,358
  • 9
  • 10
0

What does the dump of cfhttp scope look like? Specifically, what is the status code?

If you get a status code of 200, you'll need to maintain the session as you grab each image. See the following:

http://www.bennadel.com/blog/725-Maintaining-Sessions-Across-Multiple-ColdFusion-CFHttp-Requests.htm

http://www.bennadel.com/projects/cfhttp-session.htm

See this question for saving images via CFHTTP:

Convert an image from CFHTTP filecontent to binary data with Coldfusion

Community
  • 1
  • 1
Billy Cravens
  • 1,643
  • 10
  • 15
  • Can't believe I overlooked Ben's post. Thank you for the direction. It has provided me the most progress yet. – aparker81 Jan 05 '12 at 21:30
  • Take away the links and there's no answer here. Ben's a good guy and provides some great information, but what's to stop him from changing his permalinks or shutting down his blog? Please remember that this information is not just for the original asker, but also for future readers with the same problem. – ale Jan 06 '12 at 03:15
  • @Al To a certain degree I agree, but I think links to definitive resources is part of the SO aesthetic, characterized in many answers I've seen. Moreover, Ben's blog has become a de facto canonical resource; I feel as good linking to it as I would to say the Railo or ColdBox wiki. Moreover, I don't wish to plagiarize or claim credit. I could summarize and attribute of course, but I fear a loss of fidelity in translation. I believe SO is as valuable as a curated "resource of resources" as it is a repository of content and code. – Billy Cravens Jan 09 '12 at 07:31
  • You might want to read the many conversations on MSO about this. It's _not_ part of the SO aesthetic; users shouldn't have to leave the site to get the information they need. – ale Jan 09 '12 at 13:35
  • Thanks for the feedback. I'll try to add some explanation/summary to links the next time, but I'll still leave links in, to remain intellectually honest. – Billy Cravens Jan 12 '12 at 06:37
  • @Billy: yes, absolutely leave links to the source of the information (or where more details can be had). See also: http://meta.stackexchange.com/questions/8231/are-answers-that-just-contain-links-elsewhere-really-good-answers – ale Jan 12 '12 at 17:38