Getting the text values using Rvest

Question

The page in question is this: https://tolltariffen.toll.no/tolltariff/headings/03.02?language=en (Click on OPEN ALL LEVELS to get the complete data)

I'm using RSelenium to load the page and then getting the pagesource and using rvest to capture the required field. This is the data I'm trying to capture.

The code I've come up so far splits some descriptions data into multiple chunks which is not useful for me.

    x <- remdr$getPageSource()
    xpg <- read_html(x[[1]])
    
    # get the HS descriptions
    treeView <- xpg %>%
      html_nodes(xpath = '//*/div[@class="MuiGrid-root MuiGrid-container MuiGrid-wrap-xs-nowrap"]') %>%
      html_nodes(xpath = '//*/p[contains(@class, "MuiTypography-body1")]') %>%
      html_nodes('span') %>%
      html_text(trim = TRUE)

I need all the descriptions in order as a list.

Update: This is the output format. Descriptions and the 8-digit code

Can you show a couple of items in the desired output format? I am not clear if e.g. all flat fish should be a single string at a given index within a list. — QHarr, May 04 '22 at 04:25
@QHarr, edited my question. let me know if it clears your doubt — Frodo, May 04 '22 at 05:01

QHarr · Accepted Answer · 2022-05-12T04:53:53.180

General thoughts:

RSelenium isn't strictly needed, and you can avoid the overhead of launching a browser. There is an API call, you can see in the browser network tab, which supplies the content of interest, and this can be called with no requirement for additional configuration of the request e.g. headers.

The question of how to extract the items you want from the API response, in the format you want, then becomes a fun challenge (at least to me) as we do not know 1) how many levels of nesting there may be in this response (and possible future ones) 2) whether the level of nesting can vary across listings within a given response for the items of interest 3) whether there will be a commodityCode at a given level (though the pattern appears to be that there is one at the deepest level for a given listing); and we need to consider how we generate columns/lists of equal length for output. These are just some starting considerations that I go on to discuss how I handled below.

The API call:

_{* You can click on many of the smaller images below to enlarge}

The API response:

This request returns nested JSON:

The content of interest is a list of named lists, within the response, accessible via the parent "key" $headingItems:

Each of these named lists is nested as per the levels on the webpage:

You can see the repeated accessor key of headingItems (red boxed), with the first shown above as the parent list stored in data in code to follow.

Below that, indicated by level (orange boxed), are the expanded entries you are after; nested within the response JSON.

Finally, we have the descriptions (green boxed) which contains html for the descriptive text you are after, with English and Norwegian versions of the text:

In addition to this, there is, where present, a commodityCode key within the nested headingItems:

Approach and challenges:

Given that the commodityCode can be at different levels and may not be present (unless assumed to always be present at greatest depth of a given listing), and that it is unknown how many levels of headingItem there can be, the approach I chose was to use regex to identify the relevant child named list's names in a boolean mask (though for purposes here we could just say logical vector); one mask for English headers and one for the commodity codes. I processed each child list separately, using purrr::map and applying a custom function to extract data as a data.table/data.frame.

Example mask (descriptions|text):

The TRUE values are for the following chained accessors (chaining dependent on depth):

Notice how some accessor paths are repeated. This means therefore, that I do not use the mask to retrieve the names and extract the associated values. Instead, I keep the TRUE and FALSE values and thereby have equal lengths for both vectors. I combine the two logical vectors as columns within a data.table; along with the entire set of values within the child list:

This work is done within the custom function get_data, where I also then do the following steps:

I filter for only rows where there is a TRUE value i.e. a value I wish to retrieve

Apply a function utilizing gsub(), to remove non-breaking whitespace, and read_html() to convert those descriptions which are actual html to text. N.B. Some entries are not actually html and are handled by the if statement. In those cases, the input value is returned:

At this point the codes and descriptions/text are in a single column:

I use the booleans in commodity_code to update that columns value where TRUE to match the text column, and wrap in if to replace FALSE with NA.

Knowing that there is actually a 1 row offset between description and associated code, where applicable, I then shift the commodity column values down one row to correctly align with descriptions:

I then keep only the rows where description_header_flag is TRUE:

Finally, I remove the now not needed flag column:

This leaves me with a clean data.table to return from the function.

Generating the final output:

As map() applying the custom function above to a list returns a list of data.tables, I then simply call rbindlist() to combine these into a single data.table:

df <- rbindlist(map(data, get_data))

This can then be written to csv for example.

fwrite(df, 'result.csv')

Example rows in df:

N.B. I return a data.table as you showed 2 columns in your desired output.

R:

library(jsonlite)
library(tidyverse)
library(rvest)
library(data.table)

get_data <- function(x) {
  y <- x %>% unlist(recursive = T)
  t <- data.table(text = y, description_header_flag = grepl("(?:headingItems\\.)description\\.en$|^description.en$", names(y)), commodity_code = grepl("*commodityCode$", names(y)))
  t <- t[description_header_flag | commodity_code, ]
  t$text <- map2(t$text, t$description_header_flag, ~ gsub(intToUtf8(160), " ", if (.y & str_detect(.x, pattern = "<div>|<p>")) {
    html_text(read_html(.x))
  } else {
    .x
  }))
  t$commodity_code <- map2(t$commodity_code, t$text, ~ if (.x) {
    .y
  } else {
    NA
  })
  t[, commodity_code := c(NA, commodity_code[.I - 1])]
  t <- t[description_header_flag == T, ]
  t[, description_header_flag := NULL]
  return(t)
}

data <- jsonlite::read_json("https://tolltariffen.toll.no/api/search/headings/03.02") %>% .$headingItems

df <- rbindlist(map(data, get_data))

fwrite(df, "result.csv")

Sample output:

Credits:

gsub solution taken from: @shabbychef here
row shift solution adapted from: @Gary Weissman here

Hi, thanks for the detailed answer. I have already a solution to get the information from api feed. But for few headers (ex: 0403 header), the order how it appears on the page and how it is structured in api feed are not matching. So, when we get the data in incorrect order, it will affect the overall structure of the complete table. So, getting data from api will not work for me. — Frodo, May 31 '22 at 12:46
I have written the answer to produce as per your requested format with the exception of dataframe format to have two columns as per your image. — QHarr, May 31 '22 at 12:48
Yes. I have 1497 headers (0101, 0102, 0302 as in this question, upto 9999). I need to iterate the working method to all those headers. Because of the data inconsistency in api, I'm looking for a way to achieve this using RSelenium — Frodo, May 31 '22 at 12:50
Your question only references a single page and is tagged with rvest as well as RSelenium. You should probably open a new question if the answer you really need should use selenium and needs to look at other pages. Also, might be worth mentioning the API unless you are sure your desired format/content cannot be achieved with the API. — QHarr, May 31 '22 at 13:00
Yes you are right. I should have been clear in asking the question. New to SO. I have accepted your solution :) Thank you — Frodo, Jun 03 '22 at 06:49
No worries. Just that if you still need more work done on this a new question is the best way as it is frowned upon to change the requirements of an existing question which has an answer. — QHarr, Jun 03 '22 at 06:55

Getting the text values using Rvest

1 Answers1