-1

I'm trying to scrape a website however I cannot seem to get my while-loop to break out once it hits a page with no more information:

def scrape_verse_items(keyword)

  pg = 1

  while pg < 1000
    puts "page #{pg}"
    url = "https://www.bible.com/search/bible?page=#{pg}&q=#{keyword}&version_id=1"

    doc = Nokogiri::HTML(open(url))
    items = doc.css("ul.search-result li.reference")
    error = doc.css('div#noresults')

    until error.any? do
      if keyword != '' 
        item_hash = {}
        items.each do |item|
          title = item.css("h3").text.strip
          content = item.css("p").text.strip
          item_hash[title] = content

        end
      else
        puts "Please enter a valid search"
      end 

      if error.any? 
        break
      end
    end
    pg += 1
  end
  item_hash
end

puts scrape_verse_items('joy')
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Shamel
  • 11
  • 1
  • 4
  • 2
    Your `break` is inside the `until`, and thus will also break out of the `until` and not the `while`, is that intentional? – Jörg W Mittag Jan 11 '20 at 20:33
  • It seems like your `until error.any?` block will just go forever, because you don't redefine `error` from in there ... Why do you even need that `until` loop? – max pleaner Jan 11 '20 at 21:13
  • 1
    I suggest you use [Kernel#loop](https://ruby-doc.org/core-2.7.0/Kernel.html#method-i-loop) (together with the keyword [break](https://docs.ruby-lang.org/en/2.7.0/syntax/control_expressions_rdoc.html#label-break+Statement)) as your go-to method for loops of all kinds. – Cary Swoveland Jan 11 '20 at 21:19
  • 2
    https://stackoverflow.com/a/1352143/299774 – Greg Jan 11 '20 at 22:50
  • We need the minimal HTML in the question itself, not on the website, that demonstrates the problem, along with your expected results. Without that information answers can vary and not solve the problem. – the Tin Man Jan 12 '20 at 05:03
  • Thanks for all your feedback you guys are awesome and have given me lots to think about. – Shamel Jan 13 '20 at 13:10

1 Answers1

0

I know this doesn't exactly answer your question, but perhaps you might consider using a different approach altogether. Using while and until loops can get a bit confusing, and usually isn't the most performant way of doing things.

Maybe you would consider using recursion instead. I've written a small script that seems to work :

class MyScrapper

  def initialize;end

  def call(keyword)
    puts "Please enter a valid search" && return unless keyword

    scrape({}, keyword, 1)
  end

  private

  def scrape(results, keyword, page)
    doc = load_page(keyword, page)

    return results if doc.css('div#noresults').any?

    build_new_items(doc).merge(scrape(results, keyword, page+1))
  end

  def load_page(keyword, page)
    url = "https://www.bible.com/search/bible?page=#{page}&q=#{keyword}&version_id=1"
    Nokogiri::HTML(open(url))
  end

  def build_new_items(doc)
    items = doc.css("ul.search-result li.reference")

    items.reduce({}) do |list, item|
      title = item.css("h3").text.strip
      content = item.css("p").text.strip

      list[title] = content
      list
    end
  end
end

You call it by doing MyScrapper.new.call("Keyword") (It might make more sense to have this as a module you include or even have them as class methods to avoid the need to instantiate the class.

What this does is, call a method called scrape and you give it the starting results, keyword, and page. It loads the page, if there are no results it returns the existing results it has found. Otherwise it builds a hash from the page it loaded, and then the method calls itself, and merges the results with the new hash it just build. It does this till there are no more results.

If you want to limit the page results you can just change this like:

return results if doc.css('div#noresults').any?

to this:

return results if doc.css('div#noresults').any? || page > 999

Note: You might want to double-check the results that are being returned are correct. I think they should be but I wrote this quite quickly, so there could always be a small bug hiding somewhere in there.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
8bithero
  • 1,474
  • 1
  • 18
  • 23
  • "You might want to double-check the results that are being returned are correct." No, that's not how we answer questions. Make sure your results match the required results given by the OP. If they don't then the answer isn't acceptable. If the OP didn't give desired results then don't answer, instead vote on the answer and request that the OP provide the necessary information before trying to answer; Open-ended questions are off-topic as too broad, needing better definition or not debuggable. Remember, Stack Overflow isn't about the immediate answer, it's about long-term solutions. – the Tin Man Jan 12 '20 at 05:02
  • Thanks for your feedback, I really appreciate it! – Shamel Jan 13 '20 at 13:13