-2

I need some a assistance with my bash shell which to me seems to be pretty simple. I want to be able extract all of the links of a given website and print them to standard output. I want to this do all through a script of my own. My goal is to have the command and have the website, where I will be extracting all of the links from, to be an argument. Here's what I have so far:

cat > extract_links

curl $1 | grep

I dont have really much programming experience so sorry if this isn't much of a start. Is it necessary to use regular expressions? If anyone is willing to help, a code that is as simple as possible will be much appreciated. Thanks!

theamateurdataanalyst
  • 2,794
  • 4
  • 38
  • 72
  • Matching links with grep is difficult, since the anchor tags can span multiple lines. If you're not much of a programmer, I suggest you use existing tools to do this instead of trying to script your own. – Barmar Sep 21 '13 at 00:56

2 Answers2

0

This is a one-liner:

grep 'https?://[\w/&=.?]+' $1

This will find all urls. If by "link" you really mean "anchor tags" that is a bit trickier, but doable. You haven't given any sample input or output so I can't tell what you want for sure.

You can get fancier with the regex. It depends on how they are embedded in your documents

Bohemian
  • 412,405
  • 93
  • 575
  • 722
-1

This is a lot easier in Python.

Just use x = string.find('href="'), and define string as string[x:], and run string.find('"') and use that value to rip out the url(print string[x:y]). Put that into a while loop and you should be good to go

cpu2
  • 33
  • 1
  • 7