How to parse data and store it into variables using Nokogiri and Ruby

Question

When I assign variable names such as service_names and name_array they are nil and nothing goes to the class variable @@product_names.

I used Pry to try the code without storing it into a variable and it works. It has the values I need.

I had this split up in more variables before to make cleaner code, for example:

require 'pry'
require 'rubygems'
require 'open-uri'
require 'nokogiri'


  class KefotoScraper::CLI
      @@product_names =[]
      PAGE_URL = "https://kefotos.mx/"


    def call
binding.pry
    puts "These are the services that Kefoto offers:"
    #list_products
    puts "which service would you like to select?"
    @selection = gets.chomp
    view_price_range
    puts "Would you like to go back to the service menu? y/n"
    answer = gets.chomp
      if answer == "y"
      call
      end
    end

    private

    def home_html
        # @home_html ||=
        #     HTTParty.get root_path

        Nokogiri::HTML(open(PAGE_URL))


    end
    #
    # # TODO: read about ruby memoization
    # def home_node
    #
    #     @home_node ||=
    #      Nokogiri::HTML(PAGE_URL)
    # end

def service_names
      @service_names = home_html.css(".nav-link").map do
        |link| link['href'].to_s.gsub(/.php/, "")
      end
      @service_names.each do |pr|
        @@product_names << pr
      end
    end

     def list_products
       i = 1
       n = 0

       while @@product_names.length < n
        @@product_names.each do |list_item|

         puts "#{i} #{list_item[n]}"
       i += 1
       n += 1
       end
      end
   end

     def view_price_range
       price_range = []
       @service_links.each do |link|
         if @service = link
           link.css(".row").map {|price| price["p"].value}
            price_range << p
         end
         price_range
     end


    def service_links
        @service_links ||=
        home_html.css(".nav-item").map { |link| link['href'] }
    end

end
end

@@product_names should contain the code that comes out of

home_html.css(".nav-link").map { |link| link['href'] }.to_s.gsub(/.php/, "")

which later I turn back to an array.

This is what it looks like in Pry:

9] pry(#<KefotoScraper::CLI>)> home_html.css(".nav-link").map { |link| link['href'] }.to_s.gsub(/.php/, "").split(",")
=> ["[\"foto-enmarcada\"", " \"impresion-fotografica\"", " \"photobooks\"", " \"impresion-directa-canvas\"", " \"impresion-acrilico\"", " \"fotoregalos\"]"]
[10] pry(#<KefotoScraper::CLI>)> home_html.css(".nav-link").map { |link| link['href'] }.to_s.gsub(/.php/, "").split(",")[0]
=> "[\"foto-enmarcada\""

Please do not post images of plaintext. [Images of plaintext are not appropriate on StackOverflow](https://meta.stackoverflow.com/a/285557/3784008). You have access to the plaintext; please copy and paste it into your question. — anothermh, Oct 16 '19 at 19:21
Welcome to SO! Please take the time to use correct grammar and format your posting for readability. See "[How do I format my code blocks?](https://meta.stackexchange.com/questions/22186/how-do-i-format-my-code-blocks)" and "[ask]" and the linked pages and "[mcve](https://stackoverflow.com/help/minimal-reproducible-example)". It helps everyone. — the Tin Man, Oct 18 '19 at 05:29
I'd recommend reducing the example to its bare minimum that still demonstrates the problem you're having. Show us an example of what you expect as a result of the code. — the Tin Man, Oct 18 '19 at 05:37

the Tin Man · Answer 1 · 2019-10-20T23:40:07.290

Nokogiri's command-line IRB is your friend. Use nokogiri "https://kefotos.mx/" at the shell to start it up:

irb(main):006:0> @doc.css('.nav-link[href]').map { |l| l['href'].sub(/\.php$/, '') }
=> ["foto-enmarcada", "impresion-fotografica", "photobooks", "impresion-directa-canvas", "impresion-acrilico", "fotoregalos"]

That tells us it's not dynamic HTML and shows how I'd retrieve those values. Since an a tag doesn't have to contain href parameters I guarded against retrieving any such tags by accident.

You've got bugs, potential bugs and bad practices. Here are some untested but likely to work ways to fix them:

Running the code results in:

uninitialized constant KefotoScraper (NameError)

In your code you have @service and @service_links which are never initialized so...?

Don't do this because it's cruel:

def home_html
  Nokogiri::HTML(open(PAGE_URL))
end

Every time you call home_html you (re)open and (re)read the page from the remote site and wasting your and their CPU and network time. Instead, cache the parsed document in a variable kind of like you did in your commented-out line using HTTParty. It's much more friendly to not hit sites repeatedly and helps avoid getting banned.

Moving on:

def service_names
  @service_names = home_html.css(".nav-link").map do
    |link| link['href'].to_s.gsub(/.php/, "")
  end
  @service_names.each do |pr|
    @@product_names << pr
  end
end

I'd use something like get_product_names and return the array like I did in Nokogiri above:

def get_product_names
  get_html.css('.nav-link[href]').map { |l|
    l['href'].sub(/\.php$/, '')
  }
end
:
:
@@product_names = get_product_names()

Here's why I'd do it another way. You used:

link['href'].to_s.gsub(/.php/, "")

to_s is redundant because link['href'] is already returning a string. Stringizing a string wastes brain cycles when rereading/debugging the code. Be kind to yourself and don't do that.
```
require 'nokogiri'
html = '<a href="foo">'
doc = Nokogiri::HTML(html)
doc.at('a')['href'] # => "foo"
doc.at('a')['href'].class # => String
```
gsub Ew. How many occurrences of the target string do you anticipate to find and replace? If only one, which is extremely likely in a URL "href", instead use sub because it's more efficient; It only runs once and moves on whereas gsub looks through the string at least one additional time to see if it needs to run again.
/.php/ doesn't mean what you think it does, and it's a very subtle bug in waiting. /.php/ means "some character followed by "php", but you most likely meant "a period followed by 'php'". This was something I used to see all the time because other programmers I worked with didn't bother to figure out what they were doing, and being the senior guy it was my job to pick their code apart and find bugs. Instead you should use /\.php/ which removes the special meaning of ., resulting in your desired pattern which is not going to trigger if it encounters "aphp" or something similar. See "Metacharacters and Escapes" and the following section on that page for more information.
On top of the above, the pattern needs to be anchored to avoid wasting more CPU. /\.php/ will cause the regular expression engine to start at the beginning of the string and walk through it until it reaches the end. As strings get longer that process gets slower, and in production code that is processing GB of data it can slow down a system markedly. Instead, using an anchor like /\.php$/ or /\.php\z/ gives the engine a hint where it should start looking and can result in big speedups. I've got some answers on SO that go into this, and the included benchmarks show how they help. See "Anchors" for more information.

That should help you but I didn't try modifying your code to see if it did. When asking questions about bugs in your code we need the minimum code necessary to reproduce the problem. That lets us help you more quickly and efficiently. Please see "ask" and the linked pages and "mcve".

hi there. thanks for your reply. I am incorporating 'httpart' into it instead of Nokogiri::HTML and also the Regex you used. However, When I use HTTParty on it i get error ```bad argument (expected URI object or URI string) (ArgumentError)``` I have used it like this: ```HTTParty.get(@page_url)``` where @page_url = "https://kefotos.mx/" in an initialize method. Here is my repo: https://github.com/acantu101/scraping_final/blob/master/kefoto/lib/kefoto_scraper/CLI.rb — Alejandra Cantu, Oct 21 '19 at 21:32

How to parse data and store it into variables using Nokogiri and Ruby

1 Answers1