3

I have a bash script to check the HTTP status code of a list of urls, but I realize that some, while appearing to be "200", display actually a page containing "error 404". How could I check for that ?

Here's my current script :

#!/bin/bash
while read LINE; do
  curl -o /dev/null --silent --head --write-out '%{http_code}\n' "$LINE"
done < url-list.txt

(I got it from a precedent question : script to get the HTTP status code of a list of urls ?)

EDIT There seems to be a bug in the script : it returns "200" but if I wget -o log that same adress I get "404 not found"

Community
  • 1
  • 1
Manu
  • 4,410
  • 6
  • 43
  • 77
  • The script above, should work fine. If a page isn't there and the website doesn't return a status code of 404, then you can't do much about it, or at least you can't rely on this method. – c00kiemon5ter Jun 22 '11 at 12:48

2 Answers2

3

For the fun - here is an BASH solution:

dosomething() {
        code="$1"; url="$2"
        case "$code" in
                200) echo "OK for $url";;
                302) echo "redir for $url";;
                404) echo "notfound for $url";;
                *) echo "other $code for $url";;
        esac
}

#MAIN program
while read url
do
        uri=($(echo "$url" | sed 's~http://\([^/][^/]*\)\(.*\)~\1 \2~'))
        HOST=${uri[0]:=localhost}
        FILE=${uri[1]:=/}
        exec {SOCKET}<>/dev/tcp/$HOST/80
        echo -ne "GET $FILE HTTP/1.1\nHost: $HOST\n\n" >&${SOCKET}
        res=($(<&${SOCKET} sed '/^.$/,$d' | grep '^HTTP'))
        dosomething ${res[1]} "$url"
done << EOF
http://stackoverflow.com
http://stackoverflow.com/some/bad/url
EOF
clt60
  • 62,119
  • 17
  • 107
  • 194
  • Strange - you probably need newer version of bash. I have GNU bash, 4.2.0(1)-release (i386-apple-darwin10.7.0) - and working OK – clt60 Jun 23 '11 at 08:22
  • GNU bash, version 4.2.8(1)-release (i686-pc-linux-gnu) :D – Manu Jun 23 '11 at 09:01
1

Well, you could grok the response body and look for "404", "Error 404", "Not Found", "404 Not Found" etc printed in plaintext, but that is likely to give both false negatives and false positives. Though if the server sends 200 for what's supposed to be a 404 somebody didn't do their job right.

sapht
  • 2,789
  • 18
  • 16
  • I don't think I will have lots of false positives, I'm checking URLs from one domain, and all 404 contain the same text. – Manu Jun 22 '11 at 13:59
  • 1
    Oh. Then just look for that recurring substring in the body of the response. You could do it with a few lines of Perl, or if you're feeling lucky, just grep 404 and check the return value. If you know the 404s are always 100% identical you could check the response-length header (bear in mind the margin of error, you mind want to check crc32 as well). There are lots of ways to do it if the body predictable enough. (I'd go for the perl substring) – sapht Jun 22 '11 at 14:45
  • 1
    well, there are worse things, like return an OK status code, where it's actually a 404, and the 404 message is an image. o.O find a way around that :P – c00kiemon5ter Jun 22 '11 at 19:02