-1

I have a file that contains a page of google which I got after a search. I used

w3m -no-cookie $search > google

to make the page

after that I need to get all the sites contained in that page, so basically all the strings that start with "www" and end with "/"

I tried :

grep -Fw "www" google | awk -F "/" '{ print $1";" }'

but it gives me everything that is on the line before www

how do I remove that?

should I use sed?

thanks!

Epilogue
  • 63
  • 1
  • 8
  • Note that `w3m` does not give you the full url, and the string `www` will not necessarily find all urls. You also don't know in what way google's search output may change over time. – Henk Langeveld Aug 04 '12 at 17:29
  • 1
    http://stackoverflow.com/questions/1881237/easiest-way-to-extract-the-urls-from-an-html-page-using-sed-or-awk-only suggests the use of `lynx -dump -listonly`. Works for me. – Henk Langeveld Aug 04 '12 at 17:33
  • 1
    This question isn't really about string manipulation in bash, it's more about string manipulation using gnu coreutils – richo Aug 04 '12 at 18:03

1 Answers1

3

Assuming that all sites start with www is a bit weird, but here it is:

Your problem is that grep will return the whole line. With -o it will only return the matched part:

grep -wo "www.*" google | awk -F "/" '{ print $1";" }'

or simply:

grep -wo "www[^/]*" google
Karoly Horvath
  • 94,607
  • 11
  • 117
  • 176