3

I need to read the content of a website, and save HTML of a part of this page.

For example, let's say I want to get only the description of an athlete on this page : https://www.olympic.org/usain-bolt : the section.text-content element.

How can I do in Rails to store that HTML in my database, to be able to provide it later via an API?

Anyone have a clue about this?

Sebastián Palma
  • 32,692
  • 6
  • 40
  • 59
mtreize
  • 57
  • 5

2 Answers2

2

You can get the description easily openning the url, parsing the html and accessing the element you pointed, like:

require 'nokogiri'
require 'open-uri'

url = 'https://www.olympic.org/usain-bolt'
doc = Nokogiri.HTML(open(url))
puts doc.css('section.text-content').text

As you already have the data then you need a model where to store, you can create a new one, just as example called Athlete, using the rails generate command and migrate, like

$ rails g model Athlete description:text
$ rails db:migrate

The description is a text data type attribute, which allow you to store large texts, as the description.

Then you need to insert it, or update it. You can create a new record, and then update it. In the rails console, just:

Athlete.create

This will create a new athlete without description, but necessary to get it by its id. After that you can create a task, in the lib/tasks folder, you can create a file with .rake extension and add your code, using the way to create a task, like:

require 'nokogiri'
require 'open-uri'

namespace :feed do
  desc 'Gets the athlete description and insert it in database.'
  task athlete_description: :environment do
    url = 'https://www.olympic.org/usain-bolt'
    doc = Nokogiri.HTML(open(url))
    description = doc.css('section.text-content').text
    Athlete.find(1).update description: description
  end
end

You have the libraries, get the data, and update to the record using ActiveRecord, you can easily run:

rails feed:athlete_description
# or
rake feed:athlete_description
Sebastián Palma
  • 32,692
  • 6
  • 40
  • 59
1

Nokogiri might be able to do what you need by way of CSS selectors.

If not, you can use Net:HTTP to get the page contents into a local variable, then you can use string manipulation to find the piece you want and store it. Unfortunately, I don't think there's any straightforward way to select that element specifically with this method.

Matt
  • 13,948
  • 6
  • 44
  • 68