0

I have a program that searches google using either a key word or keywords that are taken as a parameter while running the program:

example: pull_sites.rb "testing" returns these sites >>>

https://en.wikipedia.org/wiki/Software_testing
http://en.wikipedia.org/wiki/Test_automation
http://www.istqb.org/about-istqb.html
http://softwaretestingfundamentals.com/test-plan/
https://en.wikipedia.org/wiki/Software_testing
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:9qU2GDLzZzEJ:https://en.wikipedia.org/wiki/Software_testing%252Btesting%26gbv%3D1%26%26ct%3Dclnk
https://en.wikipedia.org/wiki/Test_strategy
https://en.wikipedia.org/wiki/Category:Software_testing
https://en.wikipedia.org/wiki/Test_automation
https://en.wikipedia.org/wiki/Portal:Software_testing
https://en.wikipedia.org/wiki/Test
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:R94CAo00wOYJ:https://en.wikipedia.org/wiki/Test%252Btesting%26gbv%3D1%26%26ct%3Dclnk
https://en.wikipedia.org/wiki/Unit_testing
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:G9V8uRLkPjIJ:https://en.wikipedia.org/wiki/Unit_testing%252Btesting%26gbv%3D1%26%26ct%3Dclnk
https://testing.byu.edu/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:d9bGrCHr9fsJ:https://testing.byu.edu/%252Btesting%26gbv%3D1%26%26ct%3Dclnk
https://www.test.com/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:S92tylTr1V8J:https://www.test.com/%252Btesting%26gbv%3D1%26%26ct%3Dclnk
http://ddce.utexas.edu/disability/using-testing-accommodations/
http://blogs.vmware.com/virtualblocks/2015/07/06/vsan-vs-nutanix-head-to-head-performance-testing-part-4-exchange/
http://www.networkforgood.com/nonprofitblog/testing-101-4-steps-optimizing-your-fundraising-approach/
http://www.auslea.com/software-testing-training.html
http://academy.littletonpublicschools.net/Default.aspx%3Ftabid%3D12807%26articleType%3DArticleView%26articleId%3D2400
https://golang.org/pkg/testing/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:EALG7Jlm9eoJ:https://golang.org/pkg/testing/%252Btesting%26gbv%3D1%26%26ct%3Dclnk
http://www.speedtest.net/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:M47_v0xF3m8J:http://www.speedtest.net/%252Btesting%26gbv%3D1%26%26ct%3Dclnk
https://www.act.org/content/act/en/products-and-services/the-act/taking-the-test.html
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:1sMSoJBXydoJ:https://www.act.org/content/act/en/products-and-services/the-act/taking-the-test.html%252Btesting%26gbv%3D1%26%26ct%3Dclnk
http://www.act.org/content/act/en/products-and-services/the-act/test-preparation.html
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:pAzlNJl3YY4J:http://www.act.org/content/act/en/products-and-services/the-act/test-preparation.html%252Btesting%26gbv%3D1%26%26ct%3Dclnk

It works as expected but only scrapes the first page of google, is it possible to search say page 1-5?

Here's the source of the scrape:

  def get_urls
    puts "Searching...".green
    agent = Mechanize.new
    page = agent.get('http://www.google.com/')
    google_form = page.form('f')
    google_form.q = "#{SEARCH}" #SEARCH is the parameter given when program is run
    page = agent.submit(google_form, google_form.buttons.first)
    page.links.each do |link|
      if link.href.to_s =~/url.q/
        str=link.href.to_s
        strList=str.split(%r{=|&}) 
        url=strList[1] 
        File.open("links.txt", "a+"){ |s| s.puts(url) }
      end
    end 
  end
FrankS101
  • 2,112
  • 6
  • 26
  • 40
13aal
  • 1,634
  • 1
  • 21
  • 47
  • Yes it's possible. Have you tried to click or navigate to other pages? – kjprice Mar 08 '16 at 18:51
  • @kjprice How can you click and navigate to another page within a program when it's running already? The question asks if it's possible to search the pages within the program not if I can click a 2, 3, or 4.. – 13aal Mar 08 '16 at 19:01
  • @13aal Yes you can tell mechanize to click on the page links at the bottom after it has scraped the first page and then scrape those pages etc. Is that what you are asking how to do? – bkunzi01 Mar 08 '16 at 19:15
  • @13aal Here's the docs for `Mechanize` http://docs.seattlerb.org/mechanize/GUIDE_rdoc.html – kjprice Mar 08 '16 at 19:19
  • @bkunzi01 Yes that's exactly what I'm trying to do I apologize for the confusion, the docs don't seem to answer my question..? – 13aal Mar 08 '16 at 22:06
  • Take a look here as well: http://stackoverflow.com/questions/22657548/is-it-ok-to-scrape-data-from-google-results/22703153#22703153 While the pagination issue is likely solved you probably ran into captcha problems directly after. – John Apr 01 '17 at 15:46

2 Answers2

1

Ok if you are using google chrome or firefox, open up the developer tools. This will help you to identify the links you want to automate clicking. When you do a google search and then scroll to the bottom you will see the page links to click on. Using the developer tools in your browser you need to identify what class or id google is assigning these page number links. Then using mechanizes click method to follow these links. For example if the link is labelled "next" you can use something simple like:

page2 = page1.link_with(:text => "next").click

I'm answering from my phone so it may save you time to google "click a link" with mechanize for more details on it.

bkunzi01
  • 4,504
  • 1
  • 18
  • 25
  • So for example: `page_1 = "http://google.com"`;`page_2 = page_1.link_with(:text => "search").click`;`page_3 = page_2.link_with(:text => "search").click` will click through page 1, 2, and 3? – 13aal Mar 09 '16 at 18:10
  • You have the concept down however I don't think the link with text "search" is what you want based on your example. I would think you would want the link with the name "next" since that will take you to page 2 from page one. But if you are sure about the link having the text search then yes you're good to go. – bkunzi01 Mar 09 '16 at 18:51
1

That's a GET form so much easier just to make the request yourself:

https://www.google.com/search?q=foo
https://www.google.com/search?q=foo&start=10
https://www.google.com/search?q=foo&start=20
pguardiario
  • 53,827
  • 19
  • 119
  • 159