0

I am interested in building a web crawler for classifieds. The problem with crawled classifieds is that the items are constantly expiring. When a user searches on my site, is there a way to check "on the fly" if the listings are expired?

Basically, if my page displays 20 records, how to check if this is expired? Is there a way to check "on the fly"? Hide this record, rather than displaying it to the user? Perhaps a .js script that checksDeletedRecords()?

http://carsforsale.com/used_cars_for_sale/2004_Honda_Civic_136820531

BartoszKP
  • 34,786
  • 15
  • 102
  • 130
phpboy
  • 91
  • 1
  • 3

2 Answers2

1

You could write something that would periodically check the listing (via a cron job) and see if it's expired.

If the pages you are crawling have some kind of indicator that would tell you when it would expire ("Listing expires at July 8th 2011"), your crawler could parse for that and then store that in your DB. Then it's a matter of filtering out the expired ones from your end. Most classified sites have some time limit on their listings (either indicated on the listing or as a site policy).. so this approach would be your best bet.

EDIT: And as always when you are crawling, respect the site's robots.txt

Jay Sidri
  • 6,271
  • 3
  • 43
  • 62
  • the problem with this is that there is still a time gap between the cronjob_time & now. Is there a way to check the results on-the-fly? – phpboy Jul 04 '11 at 02:26
  • The only way you could do this on the fly if you could figure out when the listing is most likely to expire and store that in your end, like I described.. doing it real time is not feasible - and not advisable at all – Jay Sidri Jul 04 '11 at 02:30
  • I've seen classified aggregator sites that collapse duplicate or expired records. on the fly. and its pretty responsive.... – phpboy Jul 04 '11 at 02:41
  • They are probably doing the same thing I described. If you are still thinking of doing it 'real time' by hitting their servers, think about the overhead it would incur on your end - not to mention pissing off the webmaster on the classified site you will be constantly hitting each time someone comes to your site (do it long enough and your IP would most likely be blacklisted anyway) – Jay Sidri Jul 04 '11 at 02:55
0

I have done something like this before. My solution was to add a LastFound property to the listings. Every time you crawl the site and find the same listing, update the LastFound flag.

If you then crawl the site every day, you can assume that all listings not found in the last day are expired. Obviously, if you crawl the site at shorter intervals, your data can be more up-to-date.

This may not satisfy your "on the fly" requirement but a solution might be to check if the original page still exists each time you want to search for it. This would be horribly inefficient though and I wouldn't recomment it.

Beno
  • 4,633
  • 1
  • 27
  • 26
  • So if i display 20 results...it would be ridiculous to check for all 20 urls to see if they're valid. .. – phpboy Jul 04 '11 at 02:30
  • that's what i'm syaing, yes. That is 20 extra requests your site would have to make each time there is a search. I think this is a hard problem if you want up-to-the-second details but as Jahufar points out, if there is a listing end date, you could use that. If not, you may have to settle for listings that may be out of date by a few hours - depending on your crawl rate – Beno Jul 04 '11 at 02:33
  • can I just crawl the "header" – phpboy Jul 06 '11 at 19:39
  • see the accepted answer to this question: (http://stackoverflow.com/questions/122853/c-get-http-file-size). It looks like it is possible if the server allows the operation. However, you are still going to be sending many requests to that server. as Jahufar said, some webmasters don't take kindly to excessive requests and could blacklist you... then you're completely screwed (maybe not completely but it would be a big setback) – Beno Jul 06 '11 at 22:52