46

What options does R have for reading files in the native SAS format, sas7bdat, into R?

The NCES Common Core, for example, contains an extensive repository of data files saved in this format. For concreteness, let's focus on trying to read in this file from LEA Universe in 1997-98, which contains education-agency-level demographics for entities in all states beginning A through I.

Here's a preview from SAS of the data:

A screenshot from SAS showing 15 rows of data and the first 5 columns

What's the simplest way to bring this data in to my R environment? I don't have any version of SAS available and am not willing to pay, so simply converting it to .csv would be a hassle.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • sure, I've used several over a few years, and they always seem finicky or lack some feature I need and there are several to choose from (and more keep popping up), so it would be nice to have a wiki addressing such a common question. I have all but given up on "read sas data set" packages--I never figured out how to get `sas7bdat` to apply formats, I just tried `haven` again and it gives me an error. If I must, I use a wrapper for `Hmisc::sas.get` to read a directory of sas data sets and return a list of data frames which, although it requires a working sas, has always worked for me – rawr Mar 22 '16 at 15:17
  • @rawr if find the time and the file's public, please add an answer exemplifying the shortfalls of `haven` :) In such cases, though `Hmisc` requires a working SAS, knowing the alternatives is helpful. – MichaelChirico Mar 22 '16 at 15:28
  • `haven` reads the files just fine. I need the formats also because the mountains of data I get from sas are largely unformatted. when `haven` doesnt throw me a vague error, it doesn't really apply the formats--only keeps them as attributes requiring a [little more user legwork](http://stackoverflow.com/questions/33421854/how-can-i-import-sas-format-files-into-r/33425629#33425629)--not much, not difficult but room for errors. `Hmisc::sas.get` (and the wrapper fn I use) do all this in sas (optionally) and return the formatted data frame – rawr Mar 24 '16 at 17:29
  • 1
    another problem I run into with these is if you are trying to use a catalog which was created on unix/linux on windows (or vice versa), you get more errors. however, if you have the proc format code, you can use sas to create the format catalog native to your platform. since haven requires the catalog, if you only have the proc format code, you seem to be SOL – rawr Mar 24 '16 at 17:32

4 Answers4

57

sas7bdat worked fine for all but one of the files I was looking at (specifically, this one); in reporting the error to the sas7bdat developer, Matthew Shotwell, he also pointed me in the direction of Hadley's haven package in R which also has a read_sas method.

This method is superior for two reasons:

1) It didn't have any trouble reading the above-linked file 2) It is much (I'm talking much) faster than read.sas7bdat. Here's a quick benchmark (on this file, which is smaller than the others) for evidence:

microbenchmark(times=10L,
               read.sas7bdat("psu97ai.sas7bdat"),
               read_sas("psu97ai.sas7bdat"))

Unit: milliseconds
                              expr        min         lq       mean     median         uq        max neval cld
 read.sas7bdat("psu97ai.sas7bdat") 66696.2955 67587.7061 71939.7025 68331.9600 77225.1979 82836.8152    10   b
      read_sas("psu97ai.sas7bdat")   397.9955   402.2627   410.4015   408.5038   418.1059   425.2762    10  a 

That's right--haven::read_sas takes (on average) 99.5% less time than sas7bdat::read.sas7bdat.

minor update

I previously wasn't able to figure out whether the two methods produced the same data (i.e., that both have equal levels of fidelity with respect to reading the data), but have finally done so:

# Keep as data.tables
sas7bdat <- setDT(read.sas7bdat("psu97ai.sas7bdat"))
haven <- setDT(read_sas("psu97ai.sas7bdat"))

# read.sas7bdat prefers strings as factors,
#   and as of now has no stringsAsFactors argument
#   with which to prevent this
idj_factor <- sapply(haven, is.factor)

# Reset all factor columns as characters
sas7bdat[ , (idj_factor) := lapply(.SD, as.character), .SDcols = idj_factor]

# Check equality of the tables
all.equal(sas7bdat, haven, check.attributes = FALSE)
# [1] TRUE

However, note that read.sas7bdat has kept a massive list of attributes for the file, presumably a holdover from SAS:

str(sas7bdat)
# ...
# - attr(*, "column.info")=List of 70
#   ..$ :List of 12
#   .. ..$ name  : chr "NCESSCH"
#   .. ..$ offset: int 200
#   .. ..$ length: int 12
#   .. ..$ type  : chr "character"
#   .. ..$ format: chr "$"
#   .. ..$ fhdr  : int 0
#   .. ..$ foff  : int 76
#   .. ..$ flen  : int 1
#   .. ..$ label : chr "UNIQUE SCHOOL ID (NCES ASSIGNED)"
#   .. ..$ lhdr  : int 0
#   .. ..$ loff  : int 44
#   .. ..$ llen  : int 32
# ...

So, if by any chance you need these attributes (I know some people are particularly keen on the labels, for instance), perhaps read.sas7bdat is the option for you after all.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • 1
    haven (v. 2.1.1) seems to retain "label" attribute. From `?haven::read_sas`: _"Variable labels are stored in the "label" attribute of each variable. It is not printed on the console, but the RStudio viewer will show it."_ – Chris Holbrook Jul 15 '19 at 17:59
  • If you want to access these attributes, just type `attributes(df)`. – Vincent Sep 08 '22 at 15:55
13

As of January 18, 2018, the haven R library will load sas and stata datasets into the R environment. In R, simply:

library(haven)
data <- read_sas("C:/temp/mysasdataset.sas7bdat")
View(data)

You can also load the data manually within R studio. In the environment pane, choose

Import Dataset > From SAS...

Select the file location and click "Import"

cacti5
  • 2,006
  • 2
  • 25
  • 33
  • could you clarify what's new about 2018? I cover read_sas extensively in the answer below. feel free to edit that response if you think it needs clarification as it's a community wiki. – MichaelChirico May 02 '18 at 00:12
7

Problem

The problem looks like the files you're trying to use are poorly formatted. Specifically, blank cells are not coded (R uses NA) but are simply left empty. When trying to load the tab-delimited file this creates problems for R which thinks there are incorrect numbers of columns.

Workaround using SAS files

I've found a workaround by loading the SAS file using the sas7bdat package and then recoding blank cells ("") as NA:

install.packages("sas7bdat")
require("sas7bdat")
download.file("http://nces.ed.gov/ccd/Data/zip/ag121a_supp_sas.zip",
              destfile = "sas.zip")
unzip("sas.zip")
sas <- read.sas7bdat(file = "ag121a_supp.sas7bdat", debug = FALSE)
sas[sas == ""] <- NA

There are two issues with this method to be aware of, though:

  1. It's slow (see comments)
  2. sas7bdat package is currently considered experimental at time of writing by its author. It therefore might not load all sas files, and I would check the ones it does thoroughly for inconsistencies before use.

Non-R solution

It's not exactly canonical, but you could also download the tab-delimited files, open them in LibreOffice Calc (Microsoft Excel seems to screw things up), and find and replace all by searching for "" and replacing with NA.

Jaap
  • 81,064
  • 34
  • 182
  • 193
Phil
  • 4,344
  • 2
  • 23
  • 33
  • This certainly works; did you find it _very_ slow? It took ~3 minutes to read in one file--basically unheard of on 18,000 rows from a `data.table` user! The good part is, I just need to load them once and can save them to `csv` from `R`. – MichaelChirico May 03 '15 at 15:39
  • I did find it slow, yes. Also, I would be careful to check the data frames thoroughly once input because `sas7bdat` is considered experimental by its author at time of writing this comment. But, if it works... – Phil May 03 '15 at 16:10
  • 2
    Just heard from Prof. Shotwell again, he pointed out that `haven` is basically a `C` port into an `R` package, while `sas7bdat` is written entirely in `R`, which explains the sluggishness. – MichaelChirico May 05 '15 at 17:35
  • As far as your non-R solution: very nice, and you can use `localc` at the command line, plus `sed`, to achieve the same outcome without a gui. – isomorphismes Sep 19 '17 at 13:14
0

Another option could be my readsas package. The syntax is similar to the one of the foreign package and to others of the read-series. The imported data is returned as a data.frame() with attributes. The package was written from scratch and features read support of uncompressed and compressed sas7bdat files.

Even though it has been tested a lot, the package is still in an early development stage.

library(readsas)
dat <- read.sas("psu97ai.sas7bdat")
Jan Marvin
  • 426
  • 1
  • 4
  • 5