Consider the following few lines from a Stata .dct file which defines for Stata how to read this fixed width ASCII file (can be decompressed with any ZIP software on any platform):
start type varname width description
_column(24) long rfv1 %5f Patient's Reason for Visit #1
_column(29) long rfv2 %5f Patient's Reason for Visit #2
_column(34) long rfv3 %5f Patient's Reason for Visit #3
_column(24) long rfv13d %4f Patient's Reason for Visit #1 - broad
_column(29) long rfv23d %4f Patient's Reason for Visit #2 - broad
_column(34) long rfv33d %4f Patient's Reason for Visit #3 - broad
Basically the 24th through 39th characters in every row of this ASCII file look like this:
AAAAaBBBBbCCCCc
Where the first broad code is AAAA
, the narrower code for that same reason is AAAAa
, etc.
In other words, because the codes themselves have a heirarchical structure, the same characters in every row are read twice to create two different variables.
read.fwf
, by contrast, just takes a widths
argument, which precludes this type of double-reading.
Is there a standard way of handling this, without recreating the wheel from scratch by scan
ning in the entire file and parsing it by hand?
The background here is that I'm writing a function to parse these .DCT files, in the style of SAScii, and my job would be much simpler if I could specify (start, width)
pairs for every variable rather than just widths
.