Efficiently Import SAS into R looping with read_sas, columns to import differ by file

Question

Trying to use read_sas to import some SAS files in one folder into R, with the file name meeting the pattern "medpar20XX", (XX = 00,01,....,16). Also, a specific list of columns is needed. I used a for loop but only the last SAS data file (according to the loop increment indicator) was imported successfully, and all the resulted R data frames are this last data file.

Below is code to get the list of files in the folder that meet the file name pattern

patt = "medpar[0-9]{4}[[:punct:]]sas7bdat"
file_list <- list.files(path="E:/Data/Bell_Disasters",pattern = patt)

The file_list is here

The code to read a single file with path and file name spelled out:

medpar2000 <- read_sas("E:/Data/Bell_Disasters/medpar2000.sas7bdat", cols_only = c("HIC","PRVNUMGRP","SSLSSNF","sadmsndt","sdschrgdt"))

The SAS file was imported successfully.

Below is the for loop to read in the SAS data files. For year between 2000 -2002 the columns needed are the same (specified in cols_only=c("HIC","PRVNUMGRP","SSLSSNF","sadmsndt","sdschrgdt")) , but for other years, the columns are different. When year is between 2003 and 2006, cols_only = c('BENE_ID','PRVSTATE','PRVNUM3','PRVDRSRL','SSLSSNF','ADMSNDT','DSCHRGDT'). For year between 2007 and 2012, cols_only = c('bene_id', 'MEDPAR_ID', 'PRVDR_NUM', 'SS_LS_SNF_IND_CD', 'ADMSN_DT', 'DSCHRG_DT').

Below is the code I tested on year between 2000 and 2002 only, and I update the logic on year to if', instead offor` loop:

for (i in 1:length(file_list))
{
  # retrieve the year number in the sas file name
  year <-regmatches(file_list[i],regexpr('[0-9]{4}',file_list[i]))
  if (year %in% c('2000','2001','2002')) {
    # read in SAS data set
    temp_data <- read_sas(file.path('E:/Data/Bell_Disasters',file_list[i]), cols_only = c("HIC","PRVNUMGRP","SSLSSNF","sadmsndt","sdschrgdt"))
    # rename data set
    assign(paste('medpar',year,sep =''),temp_data)
  }
 else if (year %in% c('2003','2004','2005','2006') {

  # read in SAS data set
    temp_data <- read_sas(file.path('E:/Data/Bell_Disasters',file_list[i]), cols_only=c('BENE_ID','PRVSTATE','PRVNUM3','PRVDRSRL','SSLSSNF','ADMSNDT','DSCHRGDT')
    # rename data set
    assign(paste('medpar',year,sep =''),temp_data)
}

}

The process is extremely slow. When I force stop, I see some files are indeed imported successfully into R. Is there a way to make this process more efficient?

Per `r` tag (hover to see): *please specify all non-base packages with `library()` calls* such as `read_sas`. — Parfait, Aug 16 '19 at 19:16
By the way, re-consider using `assign` to save potentially many similar structured data frames to flood global environment. Instead use **one** list of many data frames. As @Gregor advises [Don't ever create d1 d2 d3, ..., dn in the first place. Create a list d with n elements.](https://stackoverflow.com/a/24376207/1422451) — Parfait, Aug 16 '19 at 19:17
Why are you looping over file names and also looping over years? Isn't the year part of the filename? — Tom, Aug 17 '19 at 18:50
@Tom Maybe looping over years is a bad idea, but I need a check on the year value because the `cols_only` depend on the year. — user159193, Aug 19 '19 at 14:02

score 0 · Answer 1 · answered Aug 16 '19 at 19:59

Consider lapply to build a list of named data frames. Also, wrap read_sas in tryCatch so on error it returns an empty data frame of similar columns for those problematic one or more files. From final output, investigate the zero-row data frames to find out why their imports failed.

setwd("E:/Data/Bell_Disasters")
file_list <- list.files(pattern = "medpar[0-9]{4}[[:punct:]]sas7bdat")

# BUILD LIST OF DATA FRAMES
medpar_list <- lapply(file_list, function(f) {    
   tryCatch(read_sas(f, cols_only = c("HIC", "PRVNUMGRP", "SSLSSNF", 
                                      "sadmsndt", "sdschrgdt")),
            error = function(e) data.frame(HIC=NA, PRVNUMGRP=NA, SSLSSNF=NA, 
                                           sadmsndt=NA, sdschrgdt=NA)
           )                
})

# NAME LIST OF DATA FRAMES
names(medpar_list) <- gsub(".sas7bdat", "", file_list)

# DISPLAY DATA FRAMES
head(medpar_list$medpar2000)
head(medpar_list$medpar2001)
head(medpar_list$medpar2002)
...

How do I incorporate the requirement that columns to import differ by year. i.e. for year between 2000 and 2002, `cols_only = c("HIC","PRVNUMGRP","SSLSSNF","sadmsndt","sdschrgdt")`, but when year is between 2003 and 2006, `cols_only = c('BENE_ID','PRVSTATE','PRVNUM3','PRVDRSRL','SSLSSNF','ADMSNDT','DSCHRGDT')` — user159193, Aug 20 '19 at 14:33

score 0 · Answer 2 · answered Aug 16 '19 at 20:35

You have a logic error. For each file you are looping though 2000, 2001 and 2002 and therefore at the end, all three data frames medpar2000 to medpar2002 will contain the same data corresponding to the last SAS dataset that is being read. The following code works fine as you expect to.

library(haven)

patt = "medpar[0-9]{4}[[:punct:]]sas7bdat"
file_list <- list.files(path="E:/Data/Bell_Disasters",pattern = patt)

for (i in 1:length(file_list))
{
  # retrieve the year number in the sas file name
  year <-regmatches(file_list[i],regexpr('[0-9]{4}',file_list[i]))

  # read in SAS data set
  temp_data <- read_sas(file.path('E:/Data/Bell_Disasters',file_list[i]), cols_only = c("HIC","PRVNUMGRP","SSLSSNF","sadmsndt","sdschrgdt"))
  # rename data set
  assign(paste('medpar',year,sep =''),temp_data)
}

Efficiently Import SAS into R looping with read_sas, columns to import differ by file

2 Answers2