1

The page in question is this: https://tolltariffen.toll.no/tolltariff/headings/03.02?language=en (Click on OPEN ALL LEVELS to get the complete data)

I'm using RSelenium to load the page and then getting the pagesource and using rvest to capture the required field. This is the data I'm trying to capture.

Click on OPEN ALL LEVELS to get the complete data

The code I've come up so far splits some descriptions data into multiple chunks which is not useful for me.

    x <- remdr$getPageSource()
    xpg <- read_html(x[[1]])
    
    # get the HS descriptions
    treeView <- xpg %>%
      html_nodes(xpath = '//*/div[@class="MuiGrid-root MuiGrid-container MuiGrid-wrap-xs-nowrap"]') %>%
      html_nodes(xpath = '//*/p[contains(@class, "MuiTypography-body1")]') %>%
      html_nodes('span') %>%
      html_text(trim = TRUE)

I need all the descriptions in order as a list.

Update: This is the output format. Descriptions and the 8-digit code Output format

Frodo
  • 259
  • 1
  • 7
  • Can you show a couple of items in the desired output format? I am not clear if e.g. all flat fish should be a single string at a given index within a list. – QHarr May 04 '22 at 04:25
  • 1
    @QHarr, edited my question. let me know if it clears your doubt – Frodo May 04 '22 at 05:01

1 Answers1

2

General thoughts:

RSelenium isn't strictly needed, and you can avoid the overhead of launching a browser. There is an API call, you can see in the browser network tab, which supplies the content of interest, and this can be called with no requirement for additional configuration of the request e.g. headers.

The question of how to extract the items you want from the API response, in the format you want, then becomes a fun challenge (at least to me) as we do not know 1) how many levels of nesting there may be in this response (and possible future ones) 2) whether the level of nesting can vary across listings within a given response for the items of interest 3) whether there will be a commodityCode at a given level (though the pattern appears to be that there is one at the deepest level for a given listing); and we need to consider how we generate columns/lists of equal length for output. These are just some starting considerations that I go on to discuss how I handled below.


The API call:

* You can click on many of the smaller images below to enlarge

enter image description here


The API response:

This request returns nested JSON:

enter image description here

The content of interest is a list of named lists, within the response, accessible via the parent "key" $headingItems:

Each of these named lists is nested as per the levels on the webpage:

You can see the repeated accessor key of headingItems (red boxed), with the first shown above as the parent list stored in data in code to follow.

Below that, indicated by level (orange boxed), are the expanded entries you are after; nested within the response JSON.

Finally, we have the descriptions (green boxed) which contains html for the descriptive text you are after, with English and Norwegian versions of the text:

enter image description here

In addition to this, there is, where present, a commodityCode key within the nested headingItems:

enter image description here


Approach and challenges:

Given that the commodityCode can be at different levels and may not be present (unless assumed to always be present at greatest depth of a given listing), and that it is unknown how many levels of headingItem there can be, the approach I chose was to use regex to identify the relevant child named list's names in a boolean mask (though for purposes here we could just say logical vector); one mask for English headers and one for the commodity codes. I processed each child list separately, using purrr::map and applying a custom function to extract data as a data.table/data.frame.

Example mask (descriptions|text):

enter image description here

The TRUE values are for the following chained accessors (chaining dependent on depth):

enter image description here

Notice how some accessor paths are repeated. This means therefore, that I do not use the mask to retrieve the names and extract the associated values. Instead, I keep the TRUE and FALSE values and thereby have equal lengths for both vectors. I combine the two logical vectors as columns within a data.table; along with the entire set of values within the child list:

enter image description here

This work is done within the custom function get_data, where I also then do the following steps:

  1. I filter for only rows where there is a TRUE value i.e. a value I wish to retrieve

enter image description here

  1. Apply a function utilizing gsub(), to remove non-breaking whitespace, and read_html() to convert those descriptions which are actual html to text. N.B. Some entries are not actually html and are handled by the if statement. In those cases, the input value is returned:

enter image description here

  1. At this point the codes and descriptions/text are in a single column:

enter image description here

I use the booleans in commodity_code to update that columns value where TRUE to match the text column, and wrap in if to replace FALSE with NA.

enter image description here

  1. Knowing that there is actually a 1 row offset between description and associated code, where applicable, I then shift the commodity column values down one row to correctly align with descriptions:

enter image description here

  1. I then keep only the rows where description_header_flag is TRUE:

enter image description here

  1. Finally, I remove the now not needed flag column:

enter image description here

This leaves me with a clean data.table to return from the function.


Generating the final output:

As map() applying the custom function above to a list returns a list of data.tables, I then simply call rbindlist() to combine these into a single data.table:

df <- rbindlist(map(data, get_data))

This can then be written to csv for example.

fwrite(df, 'result.csv')

Example rows in df:


N.B. I return a data.table as you showed 2 columns in your desired output.


R:

library(jsonlite)
library(tidyverse)
library(rvest)
library(data.table)

get_data <- function(x) {
  y <- x %>% unlist(recursive = T)
  t <- data.table(text = y, description_header_flag = grepl("(?:headingItems\\.)description\\.en$|^description.en$", names(y)), commodity_code = grepl("*commodityCode$", names(y)))
  t <- t[description_header_flag | commodity_code, ]
  t$text <- map2(t$text, t$description_header_flag, ~ gsub(intToUtf8(160), " ", if (.y & str_detect(.x, pattern = "<div>|<p>")) {
    html_text(read_html(.x))
  } else {
    .x
  }))
  t$commodity_code <- map2(t$commodity_code, t$text, ~ if (.x) {
    .y
  } else {
    NA
  })
  t[, commodity_code := c(NA, commodity_code[.I - 1])]
  t <- t[description_header_flag == T, ]
  t[, description_header_flag := NULL]
  return(t)
}

data <- jsonlite::read_json("https://tolltariffen.toll.no/api/search/headings/03.02") %>% .$headingItems

df <- rbindlist(map(data, get_data))

fwrite(df, "result.csv")

Sample output:

enter image description here


Credits:

  1. gsub solution taken from: @shabbychef here

  2. row shift solution adapted from: @Gary Weissman here

QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Hi, thanks for the detailed answer. I have already a solution to get the information from api feed. But for few headers (ex: 0403 header), the order how it appears on the page and how it is structured in api feed are not matching. So, when we get the data in incorrect order, it will affect the overall structure of the complete table. So, getting data from api will not work for me. – Frodo May 31 '22 at 12:46
  • I have written the answer to produce as per your requested format with the exception of dataframe format to have two columns as per your image. – QHarr May 31 '22 at 12:48
  • Yes. I have 1497 headers (0101, 0102, 0302 as in this question, upto 9999). I need to iterate the working method to all those headers. Because of the data inconsistency in api, I'm looking for a way to achieve this using RSelenium – Frodo May 31 '22 at 12:50
  • Your question only references a single page and is tagged with rvest as well as RSelenium. You should probably open a new question if the answer you really need should use selenium and needs to look at other pages. Also, might be worth mentioning the API unless you are sure your desired format/content cannot be achieved with the API. – QHarr May 31 '22 at 13:00
  • Yes you are right. I should have been clear in asking the question. New to SO. I have accepted your solution :) Thank you – Frodo Jun 03 '22 at 06:49
  • No worries. Just that if you still need more work done on this a new question is the best way as it is frowned upon to change the requirements of an existing question which has an answer. – QHarr Jun 03 '22 at 06:55