15

I am creating my own R package and I was wondering what are the possible methods that I can use to add (time-series) datasets to my package. Here are the specifics:

I have created a package subdirectory called data and I am aware that this is the location where I should save the datasets that I want to add to my package. I am also cognizant of the fact that the files containing the data may be .rda, .txt, or .csv files.

Each series of data that I want to add to the package consists of a single column of numbers (eg. of the form 340 or 4.5) and each series of data differs in length.

So far, I have saved all of the datasets into a .txt file. I have also successfully loaded the data using the data() function. Problem not solved, however.

The problem is that each series of data loads as a factor except for the series greatest in length. The series that load as factors contain missing values (of the form '.'). I had to add these missing values in order to make each column of data the same in length. I tried saving the data as unequal columns, but I received an error message after calling data().

A consequence of adding missing values to get the data to load is that once the data is loaded, I need to remove the NA's in order to get on with my analysis of the data! So, this clearly is not a good way of doing things.

Ideally (I suppose), I would like the data to load as numeric vectors or as a list. In this way, I wouldn't need the NA's appended to the end of each series.

How do I solve this problem? Should I save all of the data into one single file? If so, in what format should I do it? Perhaps I should save the datasets into a number of files? Again, in which format? What is the best practical way of doing this? Any tips would greatly be appreciated.

epo3
  • 2,991
  • 2
  • 33
  • 60
Graeme Walsh
  • 638
  • 7
  • 20

4 Answers4

9

I'm not sure if I understood your question correctly. But, if you edit your data in your favorite format and save with

save(myediteddata, file="data.rda")

The data should be loaded exactly the way you saw it in R.

To load all files in data directory you should add

LazyData: true

To your DESCRIPTION file, in your package.

If this don't help you could post one of your files and a print of the format you want, this will help us to help you ;)

user1265067
  • 867
  • 1
  • 10
  • 26
  • Thanks, user1265067. Your suggestion has helped me a lot. In the end, I decided to save each series, in my preferred format, as separate .rda files. This method works a charm for me. Now I can move on to creating .rd files and putting them into the man subdirectory in order to describe the datasets in my package. Cheers! By the way, apologies for not making my question easy to understand - it was a difficult problem to put into words. – Graeme Walsh May 13 '13 at 00:53
  • @GraemeWalsh: Can you explain how did you use these .rda files in your code? Is it possible to use .rds files? – Ankit Sep 08 '13 at 19:45
  • 1
    @Ankit Use the load() function to load the data into the workspace. http://en.wikibooks.org/wiki/R_Programming/Working_with_data_frames#Reading_and_saving_data This helps? – Graeme Walsh Sep 08 '13 at 23:32
4

In addition to saving as rda files you could also choose to load them as numeric with:

 read.table( ... , colClasses="numeric")

Or as non-factor-text:

 read.table( ..., as.is=TRUE) # which does pretty much the same as stringsAsFactors=FALSE
 read.table( ..., colClasses="character")

It also appears that the data function would accept these arguments sinc it is documented to be a simple wrapper for read.table(..., header=TRUE).

IRTFM
  • 258,963
  • 21
  • 364
  • 487
1

Preferred saving location of your data depends on its format.

As Hadley suggested:

  • If you want to store binary data and make it available to the user, put it in data/. This is the best place to put example datasets.
  • If you want to store parsed data, but not make it available to the user, put it in R/sysdata.rda. This is the best place to put data that your functions need.
  • If you want to store raw data, put it in inst/extdata.

I suggest you have a look at the linked chapter as it goes into detail about working with data when developing R packages.

epo3
  • 2,991
  • 2
  • 33
  • 60
0

You'll need to create the data file and include it in the R package, and you may want to also document it. Here's how to do both.

Create the data file and include it in R package

  • Create a directory inside the package called /data and place any data in it. Use only .rda and .RData files.
  • When creating the rda/RData file from an R object, make sure the R object is named what you want it to be named when it's used in the package and use save() to create it. Example:
save(river_fish, file = "data/river_fish.rda", version = 2)
  • Add this on a new line in the file called DESCRIPTION:
LazyData: true

Documenting the dataset

Document the dataset by placing a string with the dataset name after the documentation:

#' This is data to be included in my package
#'
#' @author My Name \email{blahblah@@roxygen.org}
#' @references \url{data_blah.com}
"data-name"

Here and here are some nice examples from dplyr.


Notes

  • To access the data in the package, run river_fish or whatever the name of the dataset is. Nothing more is needed.

  • Using version = 2 when calling save() ensures your data object is available for older R versions (i.e. prior to 3.5.0) i.e. it will prevent this warning:

WARNING: Added dependency on R >= 3.5.0 because serialized objects in serialize/load version 3 cannot be read in older versions of R.

  • No need to use load() in the R package (just call the object directly instead e.g. river_fish will be enough to yield the data from data/river_fish.rda), but in the event you do wish to load an rda/RData file for some reason (e.g. playing around or testing), this will do it:
load("data/river_fish.rda")
stevec
  • 41,291
  • 27
  • 223
  • 311