Read multiples of same columns in a fixed width file

Question

Consider the following few lines from a Stata .dct file which defines for Stata how to read this fixed width ASCII file (can be decompressed with any ZIP software on any platform):

start             type                            varname width  description
_column(24)       long                               rfv1   %5f  Patient's Reason for Visit #1            
_column(29)       long                               rfv2   %5f  Patient's Reason for Visit #2             
_column(34)       long                               rfv3   %5f  Patient's Reason for Visit #3             
_column(24)       long                             rfv13d   %4f  Patient's Reason for Visit #1 - broad     
_column(29)       long                             rfv23d   %4f  Patient's Reason for Visit #2 - broad     
_column(34)       long                             rfv33d   %4f  Patient's Reason for Visit #3 - broad

Basically the 24th through 39th characters in every row of this ASCII file look like this:

AAAAaBBBBbCCCCc

Where the first broad code is AAAA, the narrower code for that same reason is AAAAa, etc.

In other words, because the codes themselves have a heirarchical structure, the same characters in every row are read twice to create two different variables.

read.fwf, by contrast, just takes a widths argument, which precludes this type of double-reading.

Is there a standard way of handling this, without recreating the wheel from scratch by scanning in the entire file and parsing it by hand?

The background here is that I'm writing a function to parse these .DCT files, in the style of SAScii, and my job would be much simpler if I could specify (start, width) pairs for every variable rather than just widths.

Does SAScii handle fixed width files with "overlapping" columns? — A5C1D2H2I1M1N2O1R2T1, Apr 23 '13 at 16:36
@AnandaMahto From `?read.SAScii`: "This function cannot handle overlapping columns." — Ari B. Friedman, Apr 23 '13 at 20:16

A5C1D2H2I1M1N2O1R2T1 · Answer 1 · 2013-04-24T05:22:01.323

I had started working on a .DCT parser but lost steam. My intended usage scenario was to actually simply parse the file and create a csvkit schema file to allow me to use csvkit to convert the file from fixed width to csv. To that end, the package was successful, but it is very un-refined and only very minimally tested.

A couple of the problems to look out for include (1) not all DCT files have the same columns; (2) some DCT files have instructions for implicit decimal places, and I never got around to coming up with a method for dealing with those types of files.

You can find the initial work on the package here.

The main functions are:

dct.parser -- Does what you would expect. There is a "preview" argument that reads in the first few lines to let you determine whether the DCT file has all the columns you expect or not.
csvkit.schema -- Using the info extracted from dct.parser, a csv file is created with the relevant columns required by in2csv from csvkit.
csvkit.fwf2csv -- Basically a system call to csvkit. Can also be done outside of R.

For your particular example, I successfully read it using:

## The extracted data file and the DCT file are in my downloads directory
setwd("~/Downloads/") 
dct.parser("ed02.dct", preview=TRUE) ## It seems that everything is there
temp <- dct.parser("ed02.dct")       ## Can be used as a lookup table later

## The next line automatically creates a csv schema file in your 
##   working directory named, by default, "your-dct-filename.csv"
csvkit.schema(temp) 
csvkit.fwf2csv(datafile = "ED02", schema="ed02.dct.csv", output="ED02.csv")

## I haven't set up any mechanism to check on progress...
## Just check the directory and see when the file stops growing :)
ED02 <- read.csv("ED02.csv")

Another alternative that I had intended to work on (but never did) was to use paste to construct substr commands that could be used by sqldf to read in data where the columns contain overlapping data. See this blog post for an example to get started.

Update: An `sqldf` example

As mentioned above, sqldf can make good use of the output of dct.parser and read in your data by using substr. Here's an example of how you would do this:

## The extracted data file and the DCT file are in my downloads directory
setwd("~/Downloads/") 
temp <- dct.parser("ed02.dct")       ## Can be used as a lookup table later

## Construct your "substr" command
GetMe <- paste("select", 
               paste("substr(V1, ", temp$StartPos, ", ",
                     temp$ColWidth, ") `", temp$ColName, "`", 
                     sep = "", collapse = ", "), 
               "from fixed", sep = " ")

## Load "sqldf"
library(sqldf)

fixed <- file("ED02")
ED02 <- sqldf(GetMe, file.format = list(sep = "_"))
dim(ED02)
# [1] 37337   260

As can be seen, a little modification was necessary in the sqldf line. In particular, since sqldf uses read.csv.sql, it saw any comma characters in your data as a delimiter. You can just change that to something you don't expect in the data.

By the way, I had chosen to focus on "csvkit" because it's pretty fast at performing the conversion, and you end up with a convenient standard format csv file that can be shared with others (who might not be using R) (shame on them!) — A5C1D2H2I1M1N2O1R2T1, Apr 23 '13 at 16:38
This is awesome. Thanks. I'd clobbered something together using `scan` and `substr` but this looks much more fully fledged. — Ari B. Friedman, Apr 23 '13 at 17:45
I suspect that the `sqldf` approach can handle overlapping reads. — IRTFM, Apr 23 '13 at 17:49
@DWin, I had already alluded to that in my original answer. Have edited to show how it can be done. — A5C1D2H2I1M1N2O1R2T1, Apr 24 '13 at 05:22
@AriB.Friedman, `substr` is not an unreasonable approach. I've shown how to use it with `sqldf`, and that's reasonably fast. — A5C1D2H2I1M1N2O1R2T1, Apr 24 '13 at 05:28

score 1 · Answer 2 · answered Aug 22 '13 at 08:02

This has only just been tagged with Stata (thanks to @Metrics), so may well have not been noticed by many Stata enthusiasts.

From a pure Stata point of view it seems simple enough to read in each 5 digit long variable and then extract the first 4 digits by e.g.

. gen rvf13d = floor(rvf13/10)

or to read in those codes as string and then

. gen rvf13d = substr(rvf13, 1, 4)

So, you should never need to read in the same data twice.

That said, this seems oblique to the question, in which dictionary files are given and you wouldn't want to hand-edit each of several.

Read multiples of same columns in a fixed width file

2 Answers2

Update: An `sqldf` example

Linked

Read multiples of same columns in a fixed width file

2 Answers2

Update: An sqldf example

Linked

Update: An `sqldf` example