0

I'm trying to connect to an online database through R, which can be found here: http://open.canada.ca/data/en/dataset/2270e3a4-447c-45f6-8e63-aea9fe94948f

How would I be able to load the data table into R and then be able to simply change the table name in my code to access other tables? I'm not particularly concerned with what language I need to use (JSON, JSOn-LD, XML).

Thanks in advance!

Duke
  • 163
  • 8

2 Answers2

1

Assuming you know the URLs for each of the datasets a similar question can be found here:

Download a file from HTTPS using download.file()

For this it becomes:

library(RCurl)

URL <-  "http://www.statcan.gc.ca/cgi-bin/sum-som/fl/cstsaveascsv.cgi?filename=labr71a-eng.htm&lan=eng"
x <- getURL(URL)
URLout <- read.csv(textConnection(x),row.names=NULL)

I obtained the URL by right-clicking the access button and copying the address.

I had to declare row.names=NULL as the number of columns in the first row is not equal to the number of columns elsewhere, thus read.csv assumes row names as described here. I'm not sure if the URL to these datasets would change when they are updated, but this isn't a really convenient way to get this data. The JSON doesn't seem to be much better for intuitively being able to change datasets.

At least this way you could create a list of URLs and perform the following:

URL <- list(getURL("http://www.statcan.gc.ca/cgi-bin/sum-som/fl/cstsaveascsv.cgi?filename=labr71a-eng.htm&lan=eng"),
            getURL("http://www.statcan.gc.ca/cgi-bin/sum-som/fl/cstsaveascsv.cgi?filename=labr72-eng.htm&lan=eng"))

URLout <- lapply(URL,function(x) read.csv(textConnection(x),row.names=NULL,skip=2))

Again I don't like having to declare row.names=NULL and when I look at the file I'm not seeing the discrepant number of columns, however this will at least get the file into the R environment for you. It may take some more work to perform the operation over multiple URLs.

In a further effort to obtain useful colnames:

URL <-  "http://www.statcan.gc.ca/cgi-bin/sum-som/fl/cstsaveascsv.cgi?filename=labr71a-eng.htm&lan=eng"
x <- getURL(URL)
URLout <- read.csv(textConnection(x),row.names=NULL, skip=2)

The arguement skip = 2 will skip the first 2 rows when reading in the CSV, and will yield some header names. Because the headers are numbers an X will be placed in front. Row 2 in this case will have the value "number" in the second column. Unfortunately it appears this data was intended for use within excel, which is really sad.

Badger
  • 1,043
  • 10
  • 25
  • It does import the table, but it looks like there would be no easy way to import data by being able to pick a table name, since it's not in the URL. Thanks! – Duke Jul 28 '17 at 13:39
  • Yea, kind of a pain I was hoping for that too. Further they have really poor names for their CSVs. labr71a-eng, now I'd be curious if they have some kind of naming convention documentation that might help you sort out what lives where. I checked the meta data files there too. Didn't really yield a lot of information. – Badger Jul 28 '17 at 15:03
0

1) You need to download the CSV into some directory that you have access to.

2) Use "read.csv", or "read_csv", or "fread" to read that csv file into R.

yourTableName<-read.csv("C:/..../canadaDataset.csv")

3) You can name that csv into whatever object name you want.

skhan8
  • 121
  • 4
  • Yes, but then that would mean I have to download the tables individually. I'm looking for a way to create list of tables I want to have, and then execute the script so I can pull all the tables at once directly from the site. Also, if the data changes I'd have to re-pull all tables..which wouldn't be convenient – Duke Jul 27 '17 at 21:00
  • Since that's a web form (?cgi), I'm not entirely sure how you can pull all the tables at once from the site. If the file was located on the site "statcan.gc.ca/dataset.csv", then that's an entirely different story... – skhan8 Jul 27 '17 at 21:03