6

I am trying to analyze data from the 2012-2013 NATS survey, from this location. There are three files in the zip folder there, labelled 2012-2013 NATS format.sas, formats.sas7bcat and nats2012.sas7bdat. The third file contains the actual data, but the second file contains the labels that go with the data; that is, as an example, if the variable 'Race' in the raw data file has categories 1,2,3 and 4, the labels show that these categories stand for 'Caucasian', 'African-American','Hispanic' and 'Other'. I have been able to import the sas7bdat file into R, using the 'sas7bdat' package, but when I try to do cross-tabulations, I am not able to see which category each cell represents. For example, if I try to do this:

table(SMOKSTATUS_R, RACEETHNIC)

What I get is:

RACEETHNIC
SMOKSTATUS_R     1     2     3     4     5     6     7     8     9
           1  4045   455    55     7    63     0   675   393   373
           2  1183   222    38     2    26     0   217   255   154
           3 14480   957   238    14    95     3  1112   950   369
           4 23923  2532  1157    23   147     1  1755  3223   909
           5    81    18     4     0     1     0    11    17     9

As far as I can tell, the only way to inlcude the labels to the data is manually typing them in, but there are 240 variables and besides, there are labels currently existing, in the form of the format.sas7bcat file. Is there any way to import the format file into R, so that the labels can be attached to the variables? This is how it is done in SAS, but I do not have access t oSAS right now. Thanks for all the help.

KVemuri
  • 194
  • 1
  • 16
  • 2
    Do a search on the `foreign` package -- here's the [reference manual](https://cran.r-project.org/web/packages/foreign/foreign.pdf) – JasonAizkalns Oct 29 '15 at 18:30
  • Just read the SAS program that has the code to define the formats and parse that. I am not familiar with NATS, but most people that publish SAS code to define formats produce the code in a very structured format that lends itself to being parsed easily. Or use SAS to export the format catalog to a SAS dataset that you can read from R. – Tom Oct 29 '15 at 18:43
  • `readsas7dbat` or `haven` are other packages; see [this](http://stackoverflow.com/questions/30006822/read-sas-sas7bdat-data-into-r/30043226#30043226) question and answer of mine. – MichaelChirico Oct 29 '15 at 18:44
  • I usually load into SAS then export as a single Stata file and load that into R with `foreign`. I find that works the best, since it preserves the format statements as metadata. – Carl Oct 29 '15 at 19:05
  • Isn't that too roundabout a way? Anyway, thanks all for your suggestions, but I am going to wait till next week to work in SAS directly. @MichaelChirico I can't create an Xport file from the SAS dataset either, which rules out using the package foreign along with the ones you mentioned. – KVemuri Oct 29 '15 at 20:17

2 Answers2

14

This should be a one-liner:

library('haven')
sas <- read_sas('nats2012.sas7bdat', 'formats.sas7bcat')

with(sas, table(SMOKSTATUS_R, RACEETHNIC))
#             RACEETHNIC
# SMOKSTATUS_R     1     2     3     4     5     6     7     8     9
#            1  4045   455    55     7    63     0   675   393   373
#            2  1183   222    38     2    26     0   217   255   154
#            3 14480   957   238    14    95     3  1112   950   369
#            4 23923  2532  1157    23   147     1  1755  3223   909
#            5    81    18     4     0     1     0    11    17     9

table(names(attr(sas[, 'SMOKSTATUS_R'], 'labels')[sas[, 'SMOKSTATUS_R']]),
      names(attr(sas[, 'RACEETHNIC'], 'labels')[sas[, 'RACEETHNIC']]))

#                          Amer. Indian, AK Nat. Only, Non-Hispanic
# Current everyday smoker                                        63
# Current some days smoker                                       26
# Former smoker                                                  95
# Never smoker                                                  147
# Unknown                                                         1

Use haven to read in the data, but that also gives you some useful attributes, namely the variable labels:

attributes(sas$SMOKSTATUS_R)
# $label
# [1] "SMOKER STATUS (4-level)"
# 
# $class
# [1] "labelled"
# 
# $labels
# Current everyday smoker Current some days smoker            Former smoker 
#                       1                        2                        3 
# Never smoker                  Unknown 
#            4                        5 
# 
# $is_na
# [1] FALSE FALSE FALSE FALSE FALSE

You can easily write this into a function to use more generally:

do_fmt <- function(x, fmt) {
  lbl <- if (!missing(fmt))
    unlist(unname(fmt)) else attr(x, 'labels')

  if (!is.null(lbl))
    tryCatch(names(lbl[match(unlist(x), lbl)]),
             error = function(e) {
               message(sprintf('formatting failed for %s', attr(x, 'label')),
                       domain = NA)
               x
             }) else x
}

table(do_fmt(sas[, 'SMOKSTATUS_R']),
      do_fmt(sas[, 'RACEETHNIC']))

#                          Amer. Indian, AK Nat. Only, Non-Hispanic
# Current everyday smoker                                        63
# Current some days smoker                                       26
# Former smoker                                                  95
# Never smoker                                                  147
# Unknown                                                         1

And apply to the entire data set

sas[] <- lapply(sas, do_fmt)
sas$SMOKSTATUS_R[1:4]
# [1] "Never smoker"  "Former smoker" "Former smoker" "Never smoker" 

Although sometimes this fails like below. This looks like something wrong with the haven package

attr(sas$SMOKTYPE, 'labels')
# INAPPLICABLE            REFUSED                 DK    NOT ASCERTAINED 
#     -4.00000           -0.62500           -0.50000           -0.46875 
# PREMADE CIGARETTES      ROLL-YOUR-OWN               BOTH 
#            1.00000            2.00000            3.00000 

So instead of this, you can parse the format.sas file with some simple regexes

locf <- function(x) {
  x <- data.frame(x, stringsAsFactors = FALSE)
  x[x == ''] <- NA
  indx <- !is.na(x)

  x[] <- lapply(seq_along(x), function(ii) {
    idx <- cumsum(indx[, ii])
    idx[idx == 0] <- NA
    x[, ii][indx[, ii]][idx]
  })
  x[, 1]
}

fmt <- readLines('~/desktop/2012-2013-NATS-Format/2012-2013-NATS-Format.sas')
## not sure if comments are allowed in the value definitions, but
## this will check for those in case
fmt <- gsub('\\*.*;|\\/\\*.*\\*\\/', '', fmt)

vars <- gsub('(?i)value\\W+(\\w*)|.', '\\1', fmt, perl = TRUE)
vars <- locf(vars)

regex <- '[\'\"].*[\'\"]|[\\w\\d-]+'
vals <- gsub(sprintf('(?i)\\s*(%s)\\s*(=)\\s*(%s)|.', regex, regex),
               '\\1\\2\\3', fmt, perl = TRUE)

View(dd <- na.omit(data.frame(values = vars, formats = vals,
                              stringsAsFactors = FALSE)))

sp <- split(dd$formats, dd$values)
sp <- lapply(sp, function(x) {
  x <- Filter(nzchar, x)
  x <- strsplit(x, '=')
  tw <- function(x) gsub('^\\s+|\\s+$', '', x)
  sapply(x, function(y)
    setNames(tw(y[1]), tw(y[2])))
})

So the smoke type formats (one of them that failed above), for example, gets parsed like this:

sp['A5_']
# $A5_
# 'INAPPLICABLE'            'REFUSED'                 'DK' 
#           "-1"                 "-7"                 "-8" 
# 'NOT ASCERTAINED' 'PREMADE CIGARETTES'      'ROLL-YOUR-OWN'  'BOTH' 
#              "-9"                  "1"                  "2"     "3" 

And then you can use the function again to apply to the data

table(do_fmt(sas['SMOKTYPE'], sp['A5_']))

# 'BOTH'                 'DK'       'INAPPLICABLE' 
#   736                   17                51857 
# 'PREMADE CIGARETTES'            'REFUSED'      'ROLL-YOUR-OWN' 
#                 7184                    2                  396 
rawr
  • 20,481
  • 4
  • 44
  • 78
3

The formats.sas file should be readable and parseble into column label vectors, which you then apply as you would any column label vector.

If you're looking to label the categorical variables, which is presumably what you're mostly concerned about based on your question, this should be fairly straightforward. You'll see code that looks like this:

value RACEF
1 = 'Caucasian'
2 = 'African-American'
3 = 'Hispanic'
4 = 'Other'
;

You just need to parse that into a vector.

If you're lucky, their category format names will be identical to the column names (maybe with an F like I have in that example); if that's the case you can probably just work out how to apply them directly.

If it's not, you'll have to parse the second half of the program. It will consist of lines like this:

format
  race RACEF.
  gender SEXF.
  income INCRF.
...
;

That of course shows the relationship between column name and format name, and thus tells you which vector of column names you should use to label which column.

Joe
  • 62,789
  • 6
  • 49
  • 67
  • An alternate method would be to get someone with SAS to export the formats for you in a CSV or sas dataset, which is easy to do; you could even download SAS University Edition and do it yourself there. If that's a more feasible option, I can add another answer showing how to do that. – Joe Oct 29 '15 at 19:03
  • Thanks for your answer. I tried reading the `formats.sas` using the packages `haven`, `sas7bdat` and `SAScii`. I get an error message saying `Invalid file, or file has unsupported features.`. And I do not have XPORT either, in order to try `foreign`. So I am stuck until I find someone to convert from SAS to CSV, or another format. – KVemuri Oct 29 '15 at 19:24
  • `formats.sas` should be a plain old regular text file. Just read it how you normally read plain old regular text files... Or even just open it up in [text editor of your choice] and manually convert it to a CSV - that's probably not very hard, a few minutes' work if it's formatted/indented decently. – Joe Oct 29 '15 at 19:24
  • Thanks again, I was able to open the file in R, although it was not correctly indented, like you mentioned. the output looks like it is not formatted in the way I want it. Anyway, I am going to accept your answer, since it worked, but I am not willing to spend anymore time on this. I will wait till I have SAS, which is next week! Thanks once more. – KVemuri Oct 29 '15 at 20:12