4

I have some files of Micro Data from a Populational Census stored as .txt and coded in ASCII. When I open them in a text editor I get something like: 1100015110001500100100003624008705865085282310200600101011022022 14 444231 etc.

Since I have no expirience with the tabulation of ASCII data I would like to know if there is any way to get this done with R and/or what type of suplementary software do I need.

Actually at first I would like to have a "normal" look at my data, as to say, to see it as a table if possible (the filesizes vary between 40mb and 500mb). Then I would like to make some simple calculations and store the results later as a csv. to use it in other contexts.

Can anyone give me some advice?

Joschi
  • 2,941
  • 9
  • 28
  • 36
  • You provide insufficient context. In general R is able to handle such text data just fine, and ASCII is a supported encoding. You can specify a `fileEncoding` for functions such as `read.table` but you don’t need to in the first place if your data is only numeric. – Konrad Rudolph Dec 20 '12 at 12:18
  • 1
    point us to the data files you are looking at! :) – Anthony Damico Dec 20 '12 at 12:48
  • the main problem is, that the data appears in ASCII code. I don't know how to convert it to characters or how to use it in this form in R. here is a example for the data ftp://ftp.ibge.gov.br/Censos/Censo_Demografico_2010/Resultados_Gerais_da_Amostra/Microdados/AP.zip – Joschi Dec 20 '12 at 12:53
  • @Joschi where is the page that you got this link from? are there SAS import instructions anywhere? – Anthony Damico Dec 20 '12 at 12:54
  • 1
    If you mean that your source file is supposed to contain the 2- or 3- digit ASCII codes for the actual data, then you have to find out what the format (delimiters, e.g.) is of the source file. Neither R nor any other language can automagically do that for you. – Carl Witthoft Dec 20 '12 at 13:08
  • if you have access to a linux box (or cygwin on windows) call `head |od -c` this will give you the first five lines of the file character by character. if you can find the file seperator (most likely `\s` or `\t` then you can pass this as an argument to `read.table` in R and get the data in. – richiemorrisroe Dec 20 '12 at 13:22
  • another option for the brazilian censo demografico: http://www.asdfree.com/search/label/censo%20demografico%20no%20brasil%20%28censo%29 – Anthony Damico May 30 '15 at 16:43

2 Answers2

8

this brazilian census website provides a SAS importation script. the quickest way to import an ASCII data set with only a SAS importation script is to use the SAScii package. you can find the SAS importation script inside this zipped file -- it's INPUT.txt. notice that the INPUT block of those SAS importation instructions don't start until the fourth line, so your beginline parameter will be 4. test out that you're reading the SAS script correctly first with ?parse.SAScii

library(SAScii)
parse.SAScii( "INPUT.txt" , beginline = 4 )

once you see that that's printed the column names and widths correctly, you can use the ?read.SAScii function to directly read your text file into an R data frame

x <- read.SAScii( "filename.txt" , "INPUT.txt" , beginline = 4 )
head( x )

if your file is too big to read entirely into RAM, you can instead read it into a SQLite database. use the read.SAScii.sqlite() function found not in the SAScii package but in my github account here -- it's just a slight variation of the read.SAScii() function, but it doesn't overload RAM. you can see an example of its usage in the download script on this united states government survey data set website.

for more detail about the SAScii package, check out this overview

Anthony Damico
  • 5,779
  • 7
  • 46
  • 77
  • 2
    in case you can't find the SAS importation script, even if there's a layout file, you could properly construct a `read.fwf` call as Romain stated above from importing the excel layout. it has where fields begin and end, so you just have take the starting and ending positions, and use their difference as the `width` parameter :) good luck! – Anthony Damico Dec 20 '12 at 14:29
  • all right. Until know it works just fine with the SAS importation script ;) – Joschi Dec 20 '12 at 14:57
  • ok just finished reading the smallest file with 78,344 observations of 187 variables. it took about 4:30 minutes. so maybe it is really good idea to work with a database! – Joschi Dec 20 '12 at 15:20
  • @Joschi the SQLite route will actually be slower - try the largest file instead and see if it overloads your RAM :) one other options might be to use `parse.SAScii` just to determine the field widths, then use `fwf2csv` in the `descr` package.. at which point you can `read.csv` or `read.csv.sql` and still not overload RAM. [there's an example of this being done in the middle of this function](https://github.com/ajdamico/usgsd/blob/master/MonetDB/read.SAScii.monetdb.R) – Anthony Damico Dec 20 '12 at 15:33
  • @Joschi idk if you can tell, i do this a lot ;) – Anthony Damico Dec 20 '12 at 15:34
  • ok i tried and it dind't crash. But it took a while ;) Thanks for indicating and developing this package i'll surely use it a lot for future analyses of Micro Data!!! – Joschi Dec 20 '12 at 17:11
2

A good alternative is the package readr, an extremely fast solution to read fixed column width data. More info on readr here.

So instead of read.SAScii, you can use a faster option based in readr. Like this:

# Load Packages
  library(readr)
  library(SAScii)
  library(data.table)


# Parse input file
  dic_pes2013 <- parse.SAScii("INPUT.txt")

  setDT(dic_pes2013) # convert to data.table

# read to data frame
  pesdata2 <- read_fwf("./Dados/PES2013.txt", 
                       fwf_widths(dput(dic_pes2013[,width]),
                                  col_names=(dput(dic_pes2013[,varname]))),
                                  progress = interactive()
                                  )

I've just read 2.4 million records with 243 variables in 1.2 minutes (file Amostra_Pessoas_35_outras.txt).

ps. if you don't have the input.txt files, here is short script on how to create them.

Note that some variables have decimals, something that is not incorporated in the solutions provided by the answers posted here (at least so far). To take this into account, I would recommend this R script here , which will help you download the 2010 Brazilian Census data sets, read them into data frames and save them as .csv files.

Community
  • 1
  • 1
rafa.pereira
  • 13,251
  • 6
  • 71
  • 109