Manipulating strings in bash

Question

I have a file that contains a page of google which I got after a search. I used

w3m -no-cookie $search > google

to make the page

after that I need to get all the sites contained in that page, so basically all the strings that start with "www" and end with "/"

I tried :

grep -Fw "www" google | awk -F "/" '{ print $1";" }'

but it gives me everything that is on the line before www

how do I remove that?

should I use sed?

thanks!

Note that `w3m` does not give you the full url, and the string `www` will not necessarily find all urls. You also don't know in what way google's search output may change over time. — Henk Langeveld, Aug 04 '12 at 17:29
http://stackoverflow.com/questions/1881237/easiest-way-to-extract-the-urls-from-an-html-page-using-sed-or-awk-only suggests the use of `lynx -dump -listonly`. Works for me. — Henk Langeveld, Aug 04 '12 at 17:33
This question isn't really about string manipulation in bash, it's more about string manipulation using gnu coreutils — richo, Aug 04 '12 at 18:03

score 3 · Answer 1 · answered Aug 04 '12 at 17:19

Assuming that all sites start with www is a bit weird, but here it is:

Your problem is that grep will return the whole line. With -o it will only return the matched part:

grep -wo "www.*" google | awk -F "/" '{ print $1";" }'

or simply:

grep -wo "www[^/]*" google

1 Answers1