1

I am trying to find out the last chapter number of a story at www.fanfiction.net just for fun. For this I thought that since it has a fixed pattern of url I will just increment the chapter number till the time that it gives me a url which does not exist.

To find whether the url existed I tried out the script at this stackoverflow ques

However i found out that it does not give a response error of > 400 and rather gives a message along with 200 response. What would be the best way to identify that the page exists or not.

Here is a link that actually exists exists and here is one that does not exist does not exist

How can i do so ?

EDIT 1

Thanks to GregSchoen I worked it out. I hope it is correct though :)

I checked out the values for resp.getheader("last-modified", None) and it gives some date for active links and None for those which are not.

Thanks a lot

Community
  • 1
  • 1
  • 1
    Or you could fetch the first chapter, look for a `select` tag with `name="chapter"`, and read the `value` of each `option` element it contains to get a list of chapters. You could use `BeautifulSoup` to parse the HTML. – Dietrich Epp Jul 11 '11 at 00:40

3 Answers3

0

Perhaps use cURL, read 100 bytes and just look for "FanFiction.Net Message Type 1" at the start of the data?

Scott C Wilson
  • 19,102
  • 10
  • 61
  • 83
0

That website isn't giving a 404 error, which renders all of those scripts useless. You will need to download the whole webpage and check whether it looks like a 404 page.

I think just running:

if (page.find('<style>') == 0):

does the trick, as the page begins with a <style> tag (a normal page shouldn't).

Blender
  • 289,723
  • 53
  • 439
  • 496
  • could it be done by any other method than downloading the whole page because I was thinking of increasing incrementally the page number and seeing if it exists.... –  Jul 10 '11 at 23:31
  • Not really, as the `404` message (not found) isn't given. Instead, the `200` message (success) is given. You have to download the page and check whether it's the error page or not... – Blender Jul 10 '11 at 23:35
0

If you do a HEAD request on the URLs you supplied, Last-Modified is set on valid pages but not on invalid pages. This would be an easy way to key on valid pages, since their server is not responding with a proper HTTP code.

GregSchoen
  • 404
  • 3
  • 9
  • Hey could you explain some more ... how can i check for the last modified variable in the header ? –  Jul 10 '11 at 23:39