4

I am attempting to read data from the National Health Interview Survey in R: http://www.cdc.gov/nchs/nhis/nhis_2011_data_release.htm . The data is Sample Adult. The SAScii library actually has a function read.SAScii whose documentation has an example for the same data set I would like to use. The issue is it "doesn't work":

NHIS.11.samadult.SAS.read.in.instructions <- 
  "ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Program_Code/NHIS/2011/SAMADULT.sas"
NHIS.11.samadult.file.location <- 
  "ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHIS/2011/samadult.zip"

#store the NHIS file as an R data frame!
NHIS.11.samadult.df <- 
  read.SAScii ( 
    NHIS.11.samadult.file.location , 
    NHIS.11.samadult.SAS.read.in.instructions , 
    zipped = T, )

#or store the NHIS SAS import instructions for use in a 
#read.fwf function call outside of the read.SAScii function
NHIS.11.samadult.sas <- parse.SAScii( NHIS.11.samadult.SAS.read.in.instructions )

#save the data frame now for instantaneous loading later
save( NHIS.11.samadult.df , file = "NHIS.11.samadult.data.rda" )

However, when running it I get the error Error in toupper(SASinput) : invalid multibyte string 533.

Others on Stack Overflow with a similar error, but for functions such as read.delim and read.csv, have recommended to try changing the argument to fileEncoding="latin1" for example. The problem with read.SAScii is it has no such parameter fileEncoding.

See: R: invalid multibyte string and Invalid multibyte string in read.csv

Community
  • 1
  • 1
gbrlrz017
  • 199
  • 1
  • 1
  • 10
  • 2
    You might try the *haven* package. `install.packages("haven")` – Rich Scriven Dec 16 '15 at 06:25
  • Try to download it manually, unzip it, and hand both the `.dat` file and the `.sas` file directly to `read.SAScii()`. Takes forever, but works on my machine. – Felix Dec 16 '15 at 09:56
  • 1
    My hunch is that the error is caused by changes in `download.file()` to when `SAScii` was published. From the changelog of R. 3.2.3: "(Windows only) The default method for accessing URLs _via_ download.file() and url() has been changed to be "wininet" using Windows API calls. This changes the way proxies need to be set and security settings made: there have been some reports of ftp: sites being inaccessible under the new default method (but the previous methods remain available)." You may want to file a bug with the author of `SAScii`. – Felix Dec 16 '15 at 10:10
  • The following link gives examples for all three: http://blog.datacamp.com/r-data-import-tutorial/ – Marc in the box Dec 16 '15 at 11:59
  • Hi Everyone, as Felix recommended, I filed a bug with the author of `SAScii` and found that the solution was simply to run `options( encoding = "windows-1252" )` before anything. I assume this is because I am using Linux/Unix. – gbrlrz017 Dec 29 '15 at 05:25

1 Answers1

2

Just in case anyone has a similar problem, the issue and solution for me was to run options( encoding = "windows-1252" ) right before running the above code for read.SAScii since the ASCII file is meant for use in SAS and therefore on Windows. And I am using Linux.

The author of the SAScii library actually has another Github repository asdfree where he has working code for downloading CDC-NHIS datasets for all available years as well as as many other datasets from various surveys such as the American Housing Survey, FDA Drug Surveys, and many more.

The following links to the author's solution to the issue in this question. From there, you can easily find a link to the asdfree repository: https://github.com/ajdamico/SAScii/issues/3 .

As far as this dataset goes, the code in https://github.com/ajdamico/asdfree/blob/master/National%20Health%20Interview%20Survey/download%20all%20microdata.R#L8-L13 does the trick, however it doesn't encode the columns as factors or numeric properly. The good thing is that for any given dataset in an NHIS year, there are only about less than ten to twenty numeric columns where encoding these as numeric one by one is not so painful, and encoding the rest of the columns as numeric requires only a loop through the non-numeric columns.

The easiest solution for me, since I only require the Sample Adult dataset for 2011, and I was able to get my hands on a machine with SAS installed, was to run the SAS program included at http://www.cdc.gov/nchs/nhis/nhis_2011_data_release.htm to encode the columns as necessary. Finally, I used proc export to export the sas dataset onto a CSV file which I then opened in R easily with no necessary edits to the data except in dealing with missing values.

In case you want to work with NHIS datasets besides Sample Adult, it is worth noting that when I ran the available SAS program for 2010 "Sample Adult Cancer" (http://www.cdc.gov/nchs/nhis/nhis_2010_data_release.htm) and exported the data to a CSV, there was an issue with having less column names than actual columns when I attempted to read in the CSV file in R. Skipping the first line resolves this issue but you lose the descriptive column names. You can however import this same data easily without encoding with the R code in the asdfree repository. Please read the documentation there for more info.

gbrlrz017
  • 199
  • 1
  • 1
  • 10