I need help coding a simple script in Unix

Question

I need some a assistance with my bash shell which to me seems to be pretty simple. I want to be able extract all of the links of a given website and print them to standard output. I want to this do all through a script of my own. My goal is to have the command and have the website, where I will be extracting all of the links from, to be an argument. Here's what I have so far:

cat > extract_links

curl $1 | grep

I dont have really much programming experience so sorry if this isn't much of a start. Is it necessary to use regular expressions? If anyone is willing to help, a code that is as simple as possible will be much appreciated. Thanks!

Matching links with grep is difficult, since the anchor tags can span multiple lines. If you're not much of a programmer, I suggest you use existing tools to do this instead of trying to script your own. — Barmar, Sep 21 '13 at 00:56

Bohemian · Accepted Answer · 2014-07-10T23:01:55.297

0

This is a one-liner:

grep 'https?://[\w/&=.?]+' $1

This will find all urls. If by "link" you really mean "anchor tags" that is a bit trickier, but doable. You haven't given any sample input or output so I can't tell what you want for sure.

You can get fancier with the regex. It depends on how they are embedded in your documents

edited Jul 10 '14 at 23:01

answered Sep 21 '13 at 01:40

Bohemian

412,405
93
575
722

score -1 · Answer 2 · edited Sep 21 '13 at 02:06

-1

This is a lot easier in Python.

Just use x = string.find('href="'), and define string as string[x:], and run string.find('"') and use that value to rip out the url(print string[x:y]). Put that into a while loop and you should be good to go

edited Sep 21 '13 at 02:06

answered Sep 21 '13 at 01:00

cpu2

33
1
7

I need help coding a simple script in Unix

2 Answers2