4

I am scraping a database of companies available in this format, where each company is under a different website, defined by the number at the end of the url (example above is 15310; see url). I am using rvest.

I want to extract all entries shown under "Organización". Each variable name is in bold, followed by the value in normal text. In the example above there are 16 variables to extract.

There are two problems with these websites:

  1. the normal text is not inside an element (whereas variable names are). In effect, code goes something like this (notice "Value" is outside label):
     <div class="form-group">
         <label for="variable_code">variable_name</label>
         Value
     </div>
  1. not all companies have the same number of variables (16 in the example above). Some have more, others have less. Still, variable_code and variable_name are the same throughout the database.

I can think of two options to scrape the data. One is to scrape based on fixed positions. For this, I can use "nth-child" type of CSS selectors to get each variable. However, because number of variables change across companies, I need to save both the variable name and value as R variables. This is shown in the code below (for one website; for more just need to add loop, irrelevant here):

library(xml2)
library(rvest)
library(stringr)

url <- "https://tramites.economia.gob.cl/Organizacion/Details/15310"

webpage <- read_html(url) #read webpage

title_html <- html_nodes(webpage, "body > div > div:nth-child(5) > div > div:nth-child(1) > div:nth-child(3)") # this selects by element in division, after which I extract both the variable name and value as elements. Ideally, you want only the value, to allocate to a variable in a dataframe.

title <- html_text(title_html)

variable_name <- trimws(strsplit(title, "\r\n")[[1]][2])
value <- trimws(strsplit(title, "\r\n")[[1]][3])

So, the above works, but it's time consuming, since it saves variable names as variables, after which I need to transform the data.

Another option is to scrape based on labels. This is, to search for each variable in the code and get its value. Something like:

title_html <- html_nodes(webpage, "body > div > div:nth-child(5) > div > div:nth-child(1) label[for=RazonSocial]") 

The problem with this approach is that the value of each variable is a free text (i.e. outside a specific element). Thus, it cannot be obtained through CSS selectors, as explained in many places (e.g. here, here, or here). Evidently, I cannot change the html code.

What can I do to improve the scraping process? Am I stuck with the brute force, first method, extracting everything as variables? Or can I somehow gain efficiency?

PS: one way I was thinking of is to somehow get the position where the label is found using the second method and then get the value using the first. But I doubt R has this option (like address or cell in excel).

luchonacho
  • 6,759
  • 4
  • 35
  • 52

1 Answers1

1

I would do something like this:

library(xml2)
library(rvest)
library(tidyverse)

url <- "https://tramites.economia.gob.cl/Organizacion/Details/15310"

webpage <- read_html(url)

# select every 'cell'
form_groups <- webpage %>% 
  html_nodes("div.form-group")

# select cell names (label or strong)
form_groups_labels <- form_groups %>% 
  html_node("label, strong") %>% 
  html_text()

# select cell contents (text following <br>, or inside p)
form_groups_values <- form_groups %>% 
  html_node(xpath = "br/following-sibling::text()|p") %>% 
  html_text(trim = T)

# store it in a table
df <- tibble(
  id = 15310,
  labels = form_groups_labels,
  values = form_groups_values
)

# make it wider if required (after you've gathered all organizations)
df %>% 
  pivot_wider(id_cols = id,
              names_from = labels,
              values_from = values)
Bas
  • 4,628
  • 1
  • 14
  • 16
  • Thanks!. So this is basically my first method above but all done at once? How exactly were you able to do this? I mean, I see you extract a list first (`form_groups`) but then how can you apply the selection of names and content for each element in this list? Is it a combination of the pipe and the node/nodes differentiation? – luchonacho Dec 24 '19 at 12:48
  • No problem! The `form_groups` is a list of nodes (because I used `html_nodes`), each corresponding to a `div` with class `form-group` in the original web page. After that, I extract the relevant information for every element of this list. Since I only want one piece of information (either the label or the value) for every such `div`, I use `html_node`. By default, `html_node` takes a list of nodes and outputs a list of the same length (applying the css/xpath-selector on every element). The pipe is just a way to make the notation more convenient. – Bas Dec 24 '19 at 13:02
  • Ah, key then is that `html_node` takes a list as input. That does the trick. I went directly from `html_nodes` to `html_text`. Now, as I see, this still produces stacked data, whereas I was trying to get each variable as a new column, and then add rows for each company. However, stacked data might actually be way better, since that is how many packages work best (e.g. ggplot). So after all, your solution is the solution I am after. – luchonacho Dec 24 '19 at 14:06