4

I am fairly new to R and am having trouble with pulling data from the Forbes website.

My current function is:

url =

http://www.forbes.com/global2000/list/#page:1_sort:0_direction:asc_search:_filter:All%20industries_filter:All%20countries_filter:All%20states

data = readHTMLTable(url)

However, when I change the page # in the url from 1 to 2 (or to any other number), the data that is pulled is the same data from page 1. For some reason R does not pull the data from the correct page. If you manually paste the link into the browser with a specific page #, then it works fine.

Does anyone have an idea as to why this is happening?

Thanks!

  • 1
    The data is being loaded via javascript and it not in the actual HTML of the page being sent from the server. If you need a scraping method that can run javascript, try the RSelenium package. – MrFlick Feb 11 '15 at 21:46
  • Great. I will try the RSelenium package. Thanks! – Chintan Desai Feb 12 '15 at 18:23

2 Answers2

1

This appears to be an issue caused by URL fragments, which the pound sign represents. It essentially creates an anchor on a page and directs your browser to jump to that particular location.

You might be having this trouble because readHTMLTable() might not be created to work with URL fragments. See if you can find a version of the same table that does not require # in the URL.

Here are some helpful links that might shed light on what you are experiencing: What is it when a link has a pound "#" sign in it

https://support.microsoft.com/kb/202261/en-us

If I come across anything else that's helpful, I'll share it in follow-up comments.

Community
  • 1
  • 1
JFu
  • 121
  • 1
  • 6
0

What you might need to do is use the URLencode() method in R.

kdb.url <- "http://m1:5000/q.csv?select from data0 where folio0 = `KF"
kdb.url <- URLencode(kdb.url)
df <- read.csv(kdb.url, header=TRUE)

You might have meta-characters in your URL too. (Mine are the spaces and the backtick.)

>kdb.url
[1] "http://m1:5000/q.csv?select%20from%20data0%20where%20folio0%20=%20%60KF"

They think of everything those R guys.

Bonaparte
  • 459
  • 4
  • 4