-1

I'm trying to create a more effective "check if URL exist" function and I'm almost done the only roadblock is the regex.

So I'm looking for a regex that will match any first character of an output then print it and exit for example the bellow code gets the source code of the youtube page and as soon as the output reaches the title tags it matches them and it kills the wget commands

Idea borrowed from here

https://unix.stackexchange.com/questions/103252/how-do-i-get-a-websites-title-using-command-line

Performance/Efficiency

Here, out of laziness, we have perl read the whole content in memory before starting to look for the tag. Given that the title is found in the section that is in the first few bytes of the file, that's not optimal. A better approach, if GNU awk is available on your system could be:

wget -qO- 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc' | \
gawk -v IGNORECASE=1 -v RS='</title' 'RT{gsub(/.*<title[^>]*>/,"");print;exit}' 

That way, awk stops reading after the first

My logic is this: if the URL exist it will output source and I don't want to waste time by downloading the entire source code thus on the first character of source code output, print it and exit.

then I will store the output of wget and gawk

first_character_of_source_code=$(wget|awk magic)
if [[ $first_character_of_source_code != '' ]]; then
    echo "URL exists!"
else
    echo "URL doesn't exist!"
fi

Also for my "check if URL exist" function I've tried this How do I determine if a web page exists with shell scripting? the curl solution suggested in the answers is mostly ok but website like Quora return 403 Forbidden, and yes I've added user agent but the wget plus gawk solution return source code which is better for determining if the URL exists.

Community
  • 1
  • 1
bosa djo
  • 47
  • 2
  • 9

2 Answers2

2

If you weren't so committed to using awk, you could have done it quickly and easily with grep:

if wget -qO - https://stackoverflow.com/ | grep -q ""
then
  echo "wget returned at least one character."
fi
that other guy
  • 116,971
  • 11
  • 170
  • 194
  • ok I've tested your code and it seems to work equally good as mine. I used the `time` command to compare both. BTW, I was committed to using awk because I didn't know that this could be done with any other command. Actually I will accept your answer because I want to give somebody a reputation for helping me. – bosa djo Oct 02 '16 at 04:46
  • I'm basically a beginner and I don't know a lot of commands and stuff. Thinking it now, your solution might be more robust and reliable then mine – bosa djo Oct 02 '16 at 04:52
0

I found the solution thanks to @karakfa for the suggestion

match the first character of an output, print it and exit

echo "Yes, a down vote, just what I needed" | awk '{print $1;exit}' FS=""
# It will print
Y

Full source code of my script check_URL.sh (working perfectly)

# Variables
URL="$*"
user_agent="Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"

# Main program
first_character_of_source_code=$(wget -e robots=off --user-agent="$user_agent" -qO- "$URL" | \
awk '{print $1;exit}' FS="")

if [[ $first_character_of_source_code != '' ]]; then
    echo "URL exists!"
    exit 0
else
    echo "URL doesn't exist!"
    exit 1
fi
bosa djo
  • 47
  • 2
  • 9
  • 1
    Note that this won't work the way you plan to use it for binary files starting with NUL bytes (like non-hybrid ISOs). – that other guy Oct 01 '16 at 22:00
  • @thatotherguy check my answer I posted the full source code of my script. In my script I will be working with website's source code _binary files starting with NUL bytes (like non-hybrid ISOs)_ is this relevant to website's source code can you show an example web page that won't work. Anyway, thanks for your input! – bosa djo Oct 02 '16 at 04:38