2

I'm trying to read quickly into R a ASCII fixed column width dataset, based on a SAS import file (the file that declares the column widths, and etc).

I know I can use SAScii R package for translating the SAS import file (parse.SAScii) and actually importing (read.SAScii). It works but it is too slow, because read.SAScii uses read.fwf to do the data import, which is slow. I would like to change that for a fast import mathod, laf_open_fwf from the "LaF" package.

I'm almost there, using parse.SAScii() and laf_open_fwf(), but I'm able to correctly connect the output of parse.SAScii() to the arguments of laf_open_fwf().

Here is the code, the data is from PNAD, national household survey, 2013:

# Set working dir.
  setwd("C:/User/Desktop/folder")

# installing packages: 
    install.packages("SAScii")
    install.packages("LaF")
    library(SAScii)
    library(LaF)

# Donwload and unzip data and documentation files
  # Data
    file_url <- "ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/2013/Dados.zip"
    download.file(file_url,"Dados.zip", mode="wb")
    unzip("Dados.zip")
  # Documentation files
    file_url <- "ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/2013/Dicionarios_e_input_20150814.zip"
    download.file(file_url,"Dicionarios_e_input.zip", mode="wb")
    unzip("Dicionarios_e_input.zip")

# importing with read.SAScii(), based on read.fwf(): Works fine
    dom.pnad2013.teste1 <- read.SAScii("Dados/DOM2013.txt","Dicionarios_e_input/input DOM2013.txt")

# importing with parse.SAScii() and laf_open_fwf() : stuck here
    dic_dom2013 <- parse.SAScii("Dicionarios_e_input/input DOM2013.txt")
    head(dic_dom2013)
    data <- laf_open_fwf("Dados/DOM2013.txt", 
                      column_types=?????  ,
                      column_widths=dic_dom2013[,"width"],
                      column_names=dic_dom2013[,"Varname"])

I'm stuck on the last commmand, passing the importing arguments to laf_open_fwf().

rafa.pereira
  • 13,251
  • 6
  • 71
  • 109
LucasMation
  • 2,408
  • 2
  • 22
  • 45
  • 1
    hi sorry i'm late to the party :) is there a reason that you shouldn't use the PNAD auto-download script hosted here? https://github.com/ajdamico/usgsd/tree/master/Pesquisa%20Nacional%20por%20Amostra%20de%20Domicilios it might not be blazing fast, but once you run the code you don't have to do it ever again... – Anthony Damico Dec 25 '14 at 23:36

3 Answers3

5

UPDATE: here are two solutions, using packages LaF and readr.

Solution using readr (8 seconds)

readr is based on LaF but surprisingly faster. More info on readr here

# Load Packages
  library(readr)
  library(data.table)

# Parse SAS file
  dic_pes2013 <- parse.SAScii("./Dicion rios e input/input PES2013.sas")

setDT(dic_pes2013) # convert to data.table

# read to data frame
  pesdata2 <- read_fwf("Dados/DOM2013.txt", 
                   fwf_widths(dput(dic_pes2013[,width]),
                              col_names=(dput(dic_pes2013[,varname]))),
                              progress = interactive()
                              )

Take way: readr seems to be the best option: it's faster, you don't need to worry about column types, shorter code and it shows a progress bar :)

Solution using LaF (20 seconds)

LaFis one of the (maybe THE) fastest ways to read fixed-width files in R, according to this benchmark. It tooke me 20 sec. to read the person level file (PES) into a data frame.

Here is the code:

# Parse SAS file
  dic_pes2013 <- parse.SAScii("./Dicion rios e input/input PES2013.sas")
    
# Read .txt file using LaF. This is virtually instantaneous
  pesdata <- laf_open_fwf("./Dados/PES2013.txt", 
                          column_types= rep("character", length(dic_pes2013[,"width"])), 
                          column_widths=dic_pes2013[,"width"],
                          column_names=dic_pes2013[,"varname"])

# convert to data frame. This tooke me 20 sec.
  system.time( pesdata <- pesdata[,] )

Note that that I've used character in column_types. I'm not quite sure why the command returns me an error if I try integer or numeric. This shouldn't be a problem, since you can convert all columns to numeric like this:

  # convert all columns to numeric
    varposition  <- grep("V", colnames(pesdata)) 
    pesdata[varposition] <- sapply(pesdata[],as.numeric)
    sapply(pesdata, class)
Community
  • 1
  • 1
rafa.pereira
  • 13,251
  • 6
  • 71
  • 109
4

You can try the read.SAScii.sqlite, also by Anthony Damico. It's 4x faster and lead to no RAM issues (as the author himself describes). But it imports data to a SQLite self-contained database file (no SQL server needed) -- not to a data.frame. Then you can open it in R by using a dbConnection. Here it goes the GitHub adress for the code:

https://github.com/ajdamico/usgsd/blob/master/SQLite/read.SAScii.sqlite.R

In the R console, you can just run:

    source("https://raw.githubusercontent.com/ajdamico/usgsd/master/SQLite/read.SAScii.sqlite.R")

It's arguments are almost the same as those for the regular read.SAScii.

I know you are asking for a tip on how to use LaF. But I thought this could also be useful to you.

RogerioJB
  • 355
  • 1
  • 8
1

I think that the best choice is to use fwf2csv() from desc package (C++ code). I will illustrate the procedure with PNAD 2013. Be aware that i'm considering that you already have the dictionary with 3 variables: beginning of the field, size of the field, variable name, AND the dara at Data/

library(bit64)
library(data.table)
library(descr)
library(reshape)
library(survey)
library(xlsx)

end_dom <- dic_dom2013$beggining + dicdom$size - 1

fwf2csv(fwffile='Dados/DOM2013.txt', csvfile='dadosdom.csv', names=dicdom$variable, begin=dicdom$beggining, end=end_dom)

dadosdom <- fread(input='dadosdom.csv', sep='auto', sep2='auto', integer64='double')
Flavio Barros
  • 996
  • 1
  • 11
  • 29
  • Flavio, tks. UAU! This is crazy fast! For the household level file (DOM) it is instantaneous. For the person level file (PES), which is larger it took 21s to convert to csv and 4s to read (fread). With read.fwf() (wrapped with read.SAScii() ) it takes 53min. Is this a safe method though? Regarding decimal points, variable types, etc? – LucasMation Sep 23 '14 at 20:18
  • @LucasMation, this method has two parts: the conversion from micro data to CSV AND read a CSV file. The first part, probably, is the main problem fixed by fwf2csv(), once this function is written in C++. On the second part you can use any method you choose, but i think that fread() is reasonably reliable. – Flavio Barros Sep 23 '14 at 20:48