webscraping using rvest package comes out empty

Question

There is a table of taxes by country at the link below that I would like to scrape into a dataframe with Country and Tax columns.

I've tried using the rvest package as follows to get my Country column but the list I generate is empty and I don't understand why.

I would appreciate any pointers on resolving this problem.

library(rvest)
d1 <- read_html(
  "http://taxsummaries.pwc.com/ID/Corporate-income-tax-(CIT)-rates"
  )
TaxCountry <- d1 %>%
  html_nodes('.countryNameQC') %>%
  html_text()

how many columns you expect in output? There are 4 _columns_ in source — QHarr, Jul 29 '19 at 16:39

QHarr · Accepted Answer · 2019-07-30T12:56:21.493

The data is dynamically loaded and the DOM altered when javascript runs in the browser. This doesn't happen with rvest.

The following selectors, in the browser, would have isolated your nodes:

.twoCountryWrapper .countryNameAndYearQC:nth-child(1) .countryNameQC
.twoCountryWrapper .countryNameAndYearQC:nth-child(1) .countryYear 
.twoCountryWrapper .countryNameAndYearQC:nth-child(2) .countryNameQC
.twoCountryWrapper .countryNameAndYearQC:nth-child(2) .countryYear

But, those classes are not even present in rvest return.

The data of interest is actually stored in several nodes; all of which have ids within a common prefix of dspQCLinks. The data inside looks like as follows:

So, you can gather all those nodes using css attribute = value with starts with operator (^) syntax:

html_nodes(page, "[id^=dspQCLinks]")

Then extract the text and combine into one string

paste(html_text(html_nodes(page, "[id^=dspQCLinks]")), collapse = '')

Now each row in your table is delimited by !, , so we can split on that to generate the rows:

info = strsplit(paste(html_text(html_nodes(page, "[id^=dspQCLinks]")), collapse = ''),"!,")[[1]]

An example row would then look like:

"Albania@/uk/taxsummaries/wwts.nsf/ID/Albania-Corporate-Taxes-on-corporate-income@15"

If we split each row on the @, the data we want is at indices 1 and 3:

arr = strsplit(i, '@')[[1]]
country <- arr[1]
tax <- arr[3]

Thanks to @Brian's feedback I have removed the loop I had to build the dataframe and replaced with, to quote @Brian, str_split_fixed(info, "@", 3) [which] gives you a character matrix, which can be directly coerced to a dataframe.

df <- data.frame(str_split_fixed(info, "@", 3))

You then remove the empty rows at the bottom of the df.

 df <- df[df$Country != "",]

Sample of df:

R

library(rvest)
library(stringr)
library(magrittr)

page <- read_html('http://taxsummaries.pwc.com/ID/Corporate-income-tax-(CIT)-rates')
info =  strsplit(paste(html_text(html_nodes(page, "[id^=dspQCLinks]")), collapse = ''),"!,")[[1]]
df <- data.frame(str_split_fixed(info, "@", 3))
colnames(df) <- c("Country","Link","Tax")
df <- subset(df, select = c("Country","Tax"))
df <- df[df$Country != "",] 
View(df)

Python:

I did this first in python as was quicker for me:

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

r = requests.get('http://taxsummaries.pwc.com/ID/Corporate-income-tax-(CIT)-rates')
soup = bs(r.content, 'lxml')
text = ''

for i in soup.select('[id^=dspQCLinks]'):
    text+= i.text

rows = text.split('!,')
countries = []
tax_info = []

for row in rows:
    if row:
        items = row.split('@')
        countries.append(items[0])
        tax_info.append(items[2])

df = pd.DataFrame(list(zip(countries,tax_info)))
print(df)

Reading:

str_split_fixed

I would welcome suggestions on improving the above with a form of _apply_ . — QHarr, Jul 29 '19 at 21:20
You don't need `apply` or a loop. Given the vector of strings, `str_split_fixed(info, "@", 3)` gives you a character matrix, which can be directly coerced to a dataframe and any undesirable rows filtered out. — Brian, Jul 30 '19 at 01:17
thanks @Brian. I have updated my answer along the lines of what I think you meant. I note the I'm left with two "empty" rows at the bottom so have used apply to remove those, but the answer looks so much cleaner, and I suspect is more efficient. Really appreciate it. — QHarr, Jul 30 '19 at 02:27
@QHarr, you don't need `apply` there either. `df[df$Country != "",]` is the vectorized form. — Brian, Jul 30 '19 at 12:52
Many thanks @Brian. So I can simply say df <- f[df$Country != "",] Nice! You have been really helpful. — QHarr, Jul 30 '19 at 12:53

webscraping using rvest package comes out empty

1 Answers1