1

I've found this bash script to check status of URLs from text file and print the destination URL when having redirections :

#!/bin/bash
while read url
do
    dt=$(date '+%H:%M:%S');
    urlstatus=$(curl -kH 'Cache-Control: no-cache' -o /dev/null --silent --head --write-out '%{http_code} %{redirect_url}' "$url" )
    echo "$url $urlstatus $dt" >> urlstatus.txt

done < $1

I'm not that good in bash : I'd like to add - for each url - the value of its Robots meta tag (if is exists)

Sami
  • 717
  • 6
  • 28

2 Answers2

1

Actually I'd really suggest a DOM parser (e.g. Nokogiri, hxselect, etc.), but you can do this for instance (Handles lines starting with <meta and "extracts" the value of the robots' attribute content):

curl -s "$url" | sed -n '/\<meta/s/\<meta[[:space:]][[:space:]]*name="*robots"*[[:space:]][[:space:]]*content="*\([^"]*\)"*\>/\1/p'

This will print the value of the attribute or the empty string if not available.

Do you need a pure Bash solution? Or do you have sed?

stephanmg
  • 746
  • 6
  • 17
  • 1
    Thanks, I've just added a new instruction before echo : ``metarobotsheader=$(curl -s "$url" | sed -n '/\/\1/p')`` – Sami Nov 07 '19 at 13:43
0

You can add a line to extract the meta header for robots from the source code of the page and modify the line with echo to show its value:

#!/bin/bash
while read url
do
    dt=$(date '+%H:%M:%S');
    urlstatus=$(curl -kH 'Cache-Control: no-cache' -o /dev/null --silent --head --write-out '%{http_code} %{redirect_url}' "$url" )
    metarobotsheader=$(curl -kH 'Cache-Control: no-cache' --silent "$url" | grep -P -i "<meta.+robots" )
    echo "$url $urlstatus $dt $metarobotsheader" >> urlstatus.txt
done < $1

This example records the original line with the meta header for robots.

If you want to put a mark "-" when the page has no meta header for robots, you can change the metarobotsheader line, and put this one:

    metarobotsheader=$(curl -kH 'Cache-Control: no-cache' --silent "$url" | grep -P -i "<meta.+robots" || echo "-")

If you want to extract the exact value of the attribute, you can change that line:

    metarobotsheader="$(curl -kH 'Cache-Control: no-cache' --silent "$url" | grep -P -i "<meta.+robots" | perl -e '$line = <STDIN>; if ( $line =~ m#content=[\x27"]?(\w+)[\x27"]?#i) { print "$1"; } else {print "no_meta_robots";}')"

When the URL doesn't contain any meta header for robots, it will show no_meta_robots.