0

How do I make a Bash script that will copy all links (non-download website). The function is only to get all the links and then save it in a txt file.

I've tried this code:

wget --spider --force-html -r -l1 http://somesite.com | grep 'Saving to:'

Example: there are download links within a website (for example, dlink.com), so I just want to copy all words that contain dlink.com and save it into a txt file.

I've searched around using Google, and I found none of it useful.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
juicebyah
  • 141
  • 2
  • 7
  • 1
    See: http://stackoverflow.com/questions/21264626/how-to-strip-out-all-of-the-links-of-an-html-file-in-bash-or-grep-or-batch-and-s And: http://stackoverflow.com/questions/1521462/looping-through-the-content-of-a-file-in-bash – jmunsch Dec 15 '14 at 20:27
  • i think the two links are not related on my post , but thank you for the reply – juicebyah Dec 16 '14 at 11:56
  • You can use phantomjs too for this https://gist.github.com/antivanov/3848638 – Purefan Mar 04 '15 at 08:05

1 Answers1

2

Using a proper parser in Perl:

#!/usr/bin/env perl -w

use strict;
use LWP::UserAgent;
use HTML::LinkExtor;
use URI::URL;

my $ua = LWP::UserAgent->new;
my ($url, $f, $p, $res);

if(@ARGV) { 
    $url = $ARGV[0]; }
else {
    print "Enter an URL : ";
    $url = <>;
    chomp($url);
}

my @array = ();
sub callback {
   my($tag, %attr) = @_;
   return if $tag ne 'a';  # we only look closer at <a href ...>
   push(@array, values %attr) if $attr{href} =~ /dlink\.com/i;
}

# Make the parser.  Unfortunately, we don’t know the base yet
# (it might be diffent from $url)
$p = HTML::LinkExtor->new(\&callback);

# Request document and parse it as it arrives
$res = $ua->request(HTTP::Request->new(GET => $url),
                    sub {$p->parse($_[0])});

# Expand all URLs to absolute ones
my $base = $res->base;
@array = map { $_ = url($_, $base)->abs; } @array;

# Print them out
print join("\n", @array), "\n";
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223