0

There is a table of taxes by country at the link below that I would like to scrape into a dataframe with Country and Tax columns.

I've tried using the rvest package as follows to get my Country column but the list I generate is empty and I don't understand why.

I would appreciate any pointers on resolving this problem.

library(rvest)
d1 <- read_html(
  "http://taxsummaries.pwc.com/ID/Corporate-income-tax-(CIT)-rates"
  )
TaxCountry <- d1 %>%
  html_nodes('.countryNameQC') %>%
  html_text()
val
  • 1,629
  • 1
  • 30
  • 56

1 Answers1

1

The data is dynamically loaded and the DOM altered when javascript runs in the browser. This doesn't happen with rvest.

The following selectors, in the browser, would have isolated your nodes:

.twoCountryWrapper .countryNameAndYearQC:nth-child(1) .countryNameQC
.twoCountryWrapper .countryNameAndYearQC:nth-child(1) .countryYear 
.twoCountryWrapper .countryNameAndYearQC:nth-child(2) .countryNameQC
.twoCountryWrapper .countryNameAndYearQC:nth-child(2) .countryYear

But, those classes are not even present in rvest return.

The data of interest is actually stored in several nodes; all of which have ids within a common prefix of dspQCLinks. The data inside looks like as follows:

enter image description here

So, you can gather all those nodes using css attribute = value with starts with operator (^) syntax:

html_nodes(page, "[id^=dspQCLinks]")

Then extract the text and combine into one string

paste(html_text(html_nodes(page, "[id^=dspQCLinks]")), collapse = '')

Now each row in your table is delimited by !, , so we can split on that to generate the rows:

info = strsplit(paste(html_text(html_nodes(page, "[id^=dspQCLinks]")), collapse = ''),"!,")[[1]]

An example row would then look like:

"Albania@/uk/taxsummaries/wwts.nsf/ID/Albania-Corporate-Taxes-on-corporate-income@15"

If we split each row on the @, the data we want is at indices 1 and 3:

arr = strsplit(i, '@')[[1]]
country <- arr[1]
tax <- arr[3]

Thanks to @Brian's feedback I have removed the loop I had to build the dataframe and replaced with, to quote @Brian, str_split_fixed(info, "@", 3) [which] gives you a character matrix, which can be directly coerced to a dataframe.

df <- data.frame(str_split_fixed(info, "@", 3))

You then remove the empty rows at the bottom of the df.

 df <- df[df$Country != "",] 

Sample of df:

enter image description here


R

library(rvest)
library(stringr)
library(magrittr)

page <- read_html('http://taxsummaries.pwc.com/ID/Corporate-income-tax-(CIT)-rates')
info =  strsplit(paste(html_text(html_nodes(page, "[id^=dspQCLinks]")), collapse = ''),"!,")[[1]]
df <- data.frame(str_split_fixed(info, "@", 3))
colnames(df) <- c("Country","Link","Tax")
df <- subset(df, select = c("Country","Tax"))
df <- df[df$Country != "",] 
View(df)

Python:

I did this first in python as was quicker for me:

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

r = requests.get('http://taxsummaries.pwc.com/ID/Corporate-income-tax-(CIT)-rates')
soup = bs(r.content, 'lxml')
text = ''

for i in soup.select('[id^=dspQCLinks]'):
    text+= i.text

rows = text.split('!,')
countries = []
tax_info = []

for row in rows:
    if row:
        items = row.split('@')
        countries.append(items[0])
        tax_info.append(items[2])

df = pd.DataFrame(list(zip(countries,tax_info)))
print(df)

Reading:

  1. str_split_fixed
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • 1
    I would welcome suggestions on improving the above with a form of _apply_ . – QHarr Jul 29 '19 at 21:20
  • 1
    You don't need `apply` or a loop. Given the vector of strings, `str_split_fixed(info, "@", 3)` gives you a character matrix, which can be directly coerced to a dataframe and any undesirable rows filtered out. – Brian Jul 30 '19 at 01:17
  • 2
    thanks @Brian. I have updated my answer along the lines of what I think you meant. I note the I'm left with two "empty" rows at the bottom so have used apply to remove those, but the answer looks so much cleaner, and I suspect is more efficient. Really appreciate it. – QHarr Jul 30 '19 at 02:27
  • 1
    @QHarr and Brian: Thank you for the detailed explanations! – val Jul 30 '19 at 03:00
  • @QHarr, you don't need `apply` there either. `df[df$Country != "",]` is the vectorized form. – Brian Jul 30 '19 at 12:52
  • 1
    Many thanks @Brian. So I can simply say df <- f[df$Country != "",] Nice! You have been really helpful. – QHarr Jul 30 '19 at 12:53