Spider a website and retrieve all links that contain a keyword

Question

How do I make a Bash script that will copy all links (non-download website). The function is only to get all the links and then save it in a txt file.

I've tried this code:

wget --spider --force-html -r -l1 http://somesite.com | grep 'Saving to:'

Example: there are download links within a website (for example, dlink.com), so I just want to copy all words that contain dlink.com and save it into a txt file.

I've searched around using Google, and I found none of it useful.

See: http://stackoverflow.com/questions/21264626/how-to-strip-out-all-of-the-links-of-an-html-file-in-bash-or-grep-or-batch-and-s And: http://stackoverflow.com/questions/1521462/looping-through-the-content-of-a-file-in-bash — jmunsch, Dec 15 '14 at 20:27
i think the two links are not related on my post , but thank you for the reply — juicebyah, Dec 16 '14 at 11:56
You can use phantomjs too for this https://gist.github.com/antivanov/3848638 — Purefan, Mar 04 '15 at 08:05

score 2 · Answer 1 · edited Jan 06 '15 at 18:12

2

Using a proper parser in Perl:

#!/usr/bin/env perl -w

use strict;
use LWP::UserAgent;
use HTML::LinkExtor;
use URI::URL;

my $ua = LWP::UserAgent->new;
my ($url, $f, $p, $res);

if(@ARGV) { 
    $url = $ARGV[0]; }
else {
    print "Enter an URL : ";
    $url = <>;
    chomp($url);
}

my @array = ();
sub callback {
   my($tag, %attr) = @_;
   return if $tag ne 'a';  # we only look closer at <a href ...>
   push(@array, values %attr) if $attr{href} =~ /dlink\.com/i;
}

# Make the parser.  Unfortunately, we don’t know the base yet
# (it might be diffent from $url)
$p = HTML::LinkExtor->new(\&callback);

# Request document and parse it as it arrives
$res = $ua->request(HTTP::Request->new(GET => $url),
                    sub {$p->parse($_[0])});

# Expand all URLs to absolute ones
my $base = $res->base;
@array = map { $_ = url($_, $base)->abs; } @array;

# Print them out
print join("\n", @array), "\n";

edited Jan 06 '15 at 18:12

Peter Mortensen

30,738
21
105
131

answered Dec 15 '14 at 20:43

Gilles Quénot

173,512
41
224
223

hi thanks for the reply , that perl code works perfectly in a single link , is there any way to make it recursive ,crawl all pages ? – juicebyah Dec 16 '14 at 11:50
1

For some $ or €, I could do it, yes =) – Gilles Quénot Dec 16 '14 at 12:18
drop me an email dude ,if it's reasonable i might do some – juicebyah Dec 16 '14 at 14:50

Spider a website and retrieve all links that contain a keyword

1 Answers1