linux script that parses the website (url)

Question

So there is this website that shows the most popular websites. I am trying to write a script that will take two arguments: the first one is the html file, and the second one a text file. All the websites url should go to the second argument, so at the end the text file should contain stuff like:

http://www.website1.com/
http://www.website2.com/
...

If I say

cat argument1.html

stuff like this is printed:

<a href="http://babelfish.altavista.com/babelfish/trurl_pagecontent?lp=en_nl&url=http%3A%2F%2Fwww.100bestwebsites.org%2F"><img src="Holland.gif" height="33" width="50"><br>DUTCH</a></font></div></td>
     <td width="10%"> 
     <div align="center"><font face="Arial, Helvetica, sans-serif" size="2"><a href="http://babelfish.altavista.com/babelfish/trurl_pagecontent?lp=en_el&url=http%3A%2F%2Fwww.100bestwebsites.org%2F"><img src="Greece.gif" height="33" width="50"><br>GREEK</a></font></div></td>

so you guys can see that there are a bunch of characters, but somewhere in the middle there are actually the websites. I need to use grep and sed.

Any help is appreciated. I know the basics of grep and sed, but it looks for this the basics are not enough.

To do this with sed is SUCH a pain in the ass, youre better off using python/perl/ruby... anything else. specially since you can possibly have multiple — Javier Buzzi, Oct 21 '15 at 02:11
I think [this is an appropriate reference](http://stackoverflow.com/a/1732454/1270789) for what you are trying to do. You would be better, I think, using something like `ruby` with `nokogiri` or `perl` with a suitable HTML DOM parser than mucking around with `grep` and `sed`. — Ken Y-N, Oct 21 '15 at 02:11
Hahaha @KenY-N -- what can i say, great minds think alike ;) — Javier Buzzi, Oct 21 '15 at 02:12
My assignments says I cannot use any of those things you guys just mentioned. I guess the purpose of the assignment is to gain practice with sed and grep — Haz, Oct 21 '15 at 02:12
Agreed on using a DOM parser for this. If the HTML is well formatted (don't count on it), you might use an XML parser. The popular web languages usually have tools to easily parse HTML. JavaScript does it most naturally if you are open to using something like Node.js. — , Oct 21 '15 at 02:17
To check your spec, are you wanting to extract a complete URL like `http://babelfish.altavista.com/babelfish/trurl_pagecontent?lp=en_nl&url=http%3A%2F%2Fwww.100bestwebsites.org%2F`, or just the `url` portion of the URL, eg `http%3A%2F%2Fwww.100bestwebsites.org%2F` then convert that to `http://www.100bestwebsites.org/`? — Ken Y-N, Oct 21 '15 at 02:22
@KenY-N well yeah. I mean the html that is passed as a parameter has many websites so I would want all the website urls. Isn't there a quick grep-sed approach in which I find the lines that contain http, and then I replace everything before it with empty string? Same approach for replacing what is after the URL — Haz, Oct 21 '15 at 02:26

score 2 · Answer 1 · edited May 23 '17 at 12:23

2

Here you go then:

cat argument1.html | grep -o '<a href=['"'"'"][^"'"'"']*['"'"'"]' | sed -e 's/^<a href=["'"'"']//' -e 's/["'"'"']$//'

or

cat argument1.html | grep -o '<a .*href=.*>' | sed -e 's/<a/\n<a/g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'

Credit: Easiest way to extract the urls from an html page using sed or awk only

edited May 23 '17 at 12:23

Community

1
1

answered Oct 21 '15 at 02:16

Javier Buzzi

6,296
36
50

Could you explain to me? – Haz Oct 21 '15 at 02:37
Explain regex? I didn't write this, and when i do write regex after 2 hours i forget what the heck it does! I know its finds s (first grep) and then i get lost #truth i normally do this with `python` takes me < 5 mins AND i dont have to use regex – Javier Buzzi Oct 21 '15 at 02:50
2

@Haz. if you're doing an assignment for school and you now have working code, you should take it on yourself to dissect the code so you understand it. Take just the first 2 parts of the pipe line and cut/paste onto command line. Look at that output until you understand (after consulting your sed doc from your class) what is going on, then add another part of the pipeline and observe how the changes in output match the code that has been added. Repeat until you can pass the final exam for your class! Good luck. – shellter Oct 21 '15 at 02:54
I won't because that is an overkilled approach – Haz Oct 21 '15 at 03:02
@Haz Start dissecting it, looking at those quotes im sure you can get them down. But youre going to have to justify this, so you better learn it. Im not 100% sure its overkill, looking at it it handles a lot of edge cases for example: `` it will extract the `href` properly.. not everything is cookie cutter man.. – Javier Buzzi Oct 21 '15 at 03:05
Thanks for your concern but I already found a better approach. Better than your overkill suggestion. Have an excellent night – Haz Oct 21 '15 at 03:22
1

@Haz you asked a question and received an answer. No reason to call the ONLY answer you received as overkill. You don't need to accept it as your answer if you don't feel it is the appropriate answer to your question. And if you came to your own conclusion it is common courtesy to answer your own question and accept it as the answer so that other people searching the interwebs can take advantage from it. – ptierno Oct 21 '15 at 04:06
@JavierBuzzi I don't see many ways to parse an html fully without `regex`, and that is including using `python` – ptierno Oct 21 '15 at 04:09
@ptierno in python you use tools that have been written for this exact reason, under the hood they could use a parser/lexer setup or regex - my point was that you dont have to worry about it if you use those tools. Its all abstracted away from you so you can focus on what you want: the `href`s value! – Javier Buzzi Oct 21 '15 at 13:05

linux script that parses the website (url)

1 Answers1