Bash shell script to find Robots meta tag value

Question

I've found this bash script to check status of URLs from text file and print the destination URL when having redirections :

#!/bin/bash
while read url
do
    dt=$(date '+%H:%M:%S');
    urlstatus=$(curl -kH 'Cache-Control: no-cache' -o /dev/null --silent --head --write-out '%{http_code} %{redirect_url}' "$url" )
    echo "$url $urlstatus $dt" >> urlstatus.txt

done < $1

I'm not that good in bash : I'd like to add - for each url - the value of its Robots meta tag (if is exists)

I'm not sure if this is a Bash specific topic, but you can use curl and probably a little bit awk. — stephanmg, Nov 06 '19 at 09:48

stephanmg · Accepted Answer · 2019-11-06T10:29:14.623

1

Actually I'd really suggest a DOM parser (e.g. Nokogiri, hxselect, etc.), but you can do this for instance (Handles lines starting with <meta and "extracts" the value of the robots' attribute content):

curl -s "$url" | sed -n '/\<meta/s/\<meta[[:space:]][[:space:]]*name="*robots"*[[:space:]][[:space:]]*content="*\([^"]*\)"*\>/\1/p'

This will print the value of the attribute or the empty string if not available.

Do you need a pure Bash solution? Or do you have sed?

edited Nov 06 '19 at 10:29

answered Nov 06 '19 at 10:10

stephanmg

746
6
17

1

Thanks, I've just added a new instruction before echo : ``metarobotsheader=$(curl -s "$url" | sed -n '/\/\1/p')`` – Sami Nov 07 '19 at 13:43

score 0 · Answer 2 · answered Nov 06 '19 at 12:23

You can add a line to extract the meta header for robots from the source code of the page and modify the line with echo to show its value:

#!/bin/bash
while read url
do
    dt=$(date '+%H:%M:%S');
    urlstatus=$(curl -kH 'Cache-Control: no-cache' -o /dev/null --silent --head --write-out '%{http_code} %{redirect_url}' "$url" )
    metarobotsheader=$(curl -kH 'Cache-Control: no-cache' --silent "$url" | grep -P -i "<meta.+robots" )
    echo "$url $urlstatus $dt $metarobotsheader" >> urlstatus.txt
done < $1

This example records the original line with the meta header for robots.

If you want to put a mark "-" when the page has no meta header for robots, you can change the metarobotsheader line, and put this one:

    metarobotsheader=$(curl -kH 'Cache-Control: no-cache' --silent "$url" | grep -P -i "<meta.+robots" || echo "-")

If you want to extract the exact value of the attribute, you can change that line:

    metarobotsheader="$(curl -kH 'Cache-Control: no-cache' --silent "$url" | grep -P -i "<meta.+robots" | perl -e '$line = <STDIN>; if ( $line =~ m#content=[\x27"]?(\w+)[\x27"]?#i) { print "$1"; } else {print "no_meta_robots";}')"

When the URL doesn't contain any meta header for robots, it will show no_meta_robots.

Bash shell script to find Robots meta tag value

2 Answers2