4

I was trying to find a way of using wget to log a the list of redirected website URLs into one file. For example:

www.website.com/1234 now redirects to www.newsite.com/a2as4sdf6nonsense

and

www.website.com/1235 now redirects to www.newsite.com/ab6haq7ah8nonsense

Wget does output the redirect, but doesn't log the new location. I get this in the terminal:

HTTP request sent, awaiting response...301 moved permanently
Location: http.www.newsite.com/a2as4sdf6 

...

I would just like to capture that new URL to a file.

I was using something like this:

    for i in `seq 1 9999`; do
        wget http://www.website.com/$i -O output.txt
    done

But this outputs the sourcecode of each webpage to that file. I am trying to just retrieve only the redirect info. Also, I would like to add a new line to the same output file each time it retrieves a new URL.

I would like the output to look something like:

    www.website.com/1234 www.newsite.com/a2as4sdf6nonsense
    www.website.com/1235 www.newsite.com/ab6haq7ah8nonsense

...

user1467663
  • 41
  • 1
  • 3
  • 1
    If you're willing to consider Perl, instead of wget, you could try using the Perl module WWW::Mechanize as described in this solution: http://stackoverflow.com/questions/10922054/perl-wwwmechanize-or-lwp-get-redirect-url – David Jun 19 '12 at 22:58
  • This worked for me. Thanks! The only part I'm getting stuck on now is using the code mentioned and looping within Perl. How do I have this run for: *www.website.com/n* where *n* is a number that counts from say 1 to 100? – user1467663 Jun 20 '12 at 17:36
  • `foreach(1..100) { my $site = "www.website.com/$_"; # do something with $site; }` – David Jun 20 '12 at 17:41

1 Answers1

2

It's not a perfect solution, but it works:

wget http://tinyurl.com/2tx --server-response -O /dev/null 2>&1 |\
   awk '(NR==1){SRC=$3;} /^  Location: /{DEST=$2} END{ print SRC, DEST}'

wget is not a perfect tool for that. curl would be bit better.

This is how it works: we get url, but we redirect all output (page content) to /dev/null. We ask for server response http headers (to get Loaction header), then we pass it to awk. Note, that there might be several redirections. I assumed you want the last one. Awk gets the URL you asked for from the first line (NR==1) and destination URL from each Location header. At the end, we print both SRC and DESC as you wanted.

Michał Šrajer
  • 30,364
  • 7
  • 62
  • 85