6

I need to read a .dat file using a .dct file. Has anyone done that using R?

The format is:

dictionary {
  # how many lines per record
  _lines(1)
  # start defining the first line
  _line(1)

  # starting column / storage type / variable name / read format / variable label
  _column(1)    str8    aid    %8s    "respondent identifier"
  ...
}

'read formats' are like:

%2f        2 column integer variable
%12s      12 column string variable
%8.2f      8 column number with 2 implied decimal places. 

Storage types are described here: http://www.stata.com/help.cgi?datatypes

Other sites used for info:

http://library.columbia.edu/indiv/dssc/technology/stata_write.html

http://www.stata.com/support/faqs/data-management/reading-fixed-format-data/

The .dat file is a bunch of numbers corresponding to the variables specified in the .dct file. (Presumably this is data in fixed width columns).

Here a real example:

.dtc file http://goo.gl/qHZOk

data http://goo.gl/FRGRF

A specific example from the stata site is:

The .dat file ("test.raw" in this instance)

C1245A101George Costanza
B1223B011Cosmo Kramer

The .dct file

dictionary using test2.raw {
 _column(1)     str5     code   %5s
 _column(2)     int      call   %4f
 _column(6)     str1     city   %1s
 _column(7)     int      neigh  %3f
 _column(10)    str16    name   %16s
}

The resulting data file:

      +-----------------------------------------------+
      |  code   call   city   neigh              name |
      |-----------------------------------------------|
   1. | C1245   1245      A     101   George Costanza |
   2. | B1223   1223      B      11      Cosmo Kramer |
      +-----------------------------------------------+
sdaza
  • 1,032
  • 13
  • 29
  • 2
    Can you provide some documentation or reference about these files you're talking about? From some preliminary searching I'm guessing these are files from Stata? – Dason Jan 08 '13 at 21:45
  • 2
    What is a `.dct` file? What specific `.dat` filetype are you talking about? We are going to need more detailed information to answer you. – thelatemail Jan 08 '13 at 21:46
  • 1
    Give us some example files. Complete examples. And more info about where they come from. Otherwise the solution is just as likely to be found by a million monkeys with a million typewriters. – Spacedman Jan 08 '13 at 22:36
  • I wholeheartedly agree with @Spacedman, **IF* these files come from *stata* (which is guesswork), perhaps the `memisc` package will be useful, as suggested in the help for `read.dta`, which you would have navigated towards after reading the wonderful [Data Import / Export Manual](http://cran.r-project.org/doc/manuals/r-release/R-data.html#Importing-from-other-statistical-systems) – mnel Jan 08 '13 at 22:40
  • 1
    @sdaza - I have edited your question to provide some *actual* useful information. This was from 5 minutes on Google - could you verify if this looks ok? – thelatemail Jan 09 '13 at 04:37
  • Thank you @thelatemail, my question is just if there is way to read those files using R. I have several big .dct and .dat files that I would like to read using R. Any ideas? – sdaza Jan 09 '13 at 05:36

2 Answers2

14

@thelatemail is spot-on about how to proceed. Here's a small function I threw together to get you started on a more robust solution:

read.dat.dct <- function(dat, dct) {
    temp <- readLines(dct)
    pattern <- "_column\\(([0-9]+)\\)\\s+([a-z0-9]+)\\s+([a-z0-9_]+)\\s+%([0-9]+).*"
    classes <- c("numeric", "character", "character", "numeric")
    metadata <- setNames(lapply(1:4, function(x) {
        out <- gsub(pattern, paste("\\", x, sep = ""), temp)
        out <- gsub("^\\s+|\\s+$|.*\\{|\\}", "", out)
        out <- out[out != ""]
        class(out) <- classes[x] ; out }), 
                         c("StartPos", "Str", "ColName", "ColWidth"))
    read.fwf(dat, widths = metadata[["ColWidth"]], 
             col.names = metadata[["ColName"]])
}

There is still a lot you would have to do with respect to error checking, generalizing the function, and so on. For example, this function does not work with overlapping columns, as are present in the example that @thelatemail added to your question. Some error checking in the form of "StartPos[n] + ColWidth[n]" should equal "StartPos[n+1]" could be used to stop reading the file if this is not true with an error message. Additionally, the classes of the resulting data can also be extracted from the "metadata" list generated by the function and assigned in read.fwf using the colClasses argument.

Here is a dat file and a dct file to demonstrate:

Copy and paste the following two lines into a text editor and save it in your working directory as "test.dat".

C1245A101George Costanza
B1223B011Cosmo Kramer

Copy and paste the following lines into a text editor and save it in your working directory as "test.dct"

dictionary using test.dat {
    _column(1)     str1     code   %1s
    _column(2)     int      call   %4f
    _column(6)     str1     city   %1s
    _column(7)     int      neigh  %3f
    _column(10)    str16    name   %16s
}

Now, run the function:

read.dat.dct(dat = "test.dat", dct = "test.dct")
#   code call city neigh            name
# 1    C 1245    A   101 George Costanza
# 2    B 1223    B    11    Cosmo Kramer

Update: An improved function (with still a lot of room for improvement)

read.dat.dct <- function(dat, dct, labels.included = "no") {
    temp <- readLines(dct)
    temp <- temp[grepl("_column", temp)]
    switch(labels.included,
           yes = {
               pattern <- "_column\\(([0-9]+)\\)\\s+([a-z0-9]+)\\s+(.*)\\s+%([0-9]+)[a-z]\\s+(.*)"
               classes <- c("numeric", "character", "character", "numeric", "character")
               N <- 5
               NAMES <- c("StartPos", "Str", "ColName", "ColWidth", "ColLabel")
           },
           no = {
               pattern <- "_column\\(([0-9]+)\\)\\s+([a-z0-9]+)\\s+(.*)\\s+%([0-9]+).*"
               classes <- c("numeric", "character", "character", "numeric")
               N <- 4
               NAMES <- c("StartPos", "Str", "ColName", "ColWidth")
           })
    metadata <- setNames(lapply(1:N, function(x) {
        out <- gsub(pattern, paste("\\", x, sep = ""), temp)
        out <- gsub("^\\s+|\\s+$", "", out)
        out <- gsub('\"', "", out, fixed = TRUE)
        class(out) <- classes[x] ; out }), NAMES)

    metadata[["ColName"]] <- make.names(gsub("\\s", "", metadata[["ColName"]]))

    myDF <- read.fwf(dat, widths = metadata[["ColWidth"]], 
             col.names = metadata[["ColName"]])
    if (labels.included == "yes") {
        attr(myDF, "col.label") <- metadata[["ColLabel"]]
    }
    myDF
}

How does it work with your data?

temp <- read.dat.dct(dat = "http://dl.getdropbox.com/u/18116710/21600-0009-Data.txt", 
                     dct = "http://dl.getdropbox.com/u/18116710/21600-0009-Setup.dct",
                     labels.included = "yes")
dim(temp)                     # How big is the dataset?
# [1] 180  40
head(temp[, 1:6])             # What do the first few columns & rows look like?
#   CASEID      AID RRELNO RPREGNO H3PC1.H3PC1 H3PC2.H3PC2
# 1      1 57118381      5       1           1           1
# 2      2 57134970      1       2           1           1
# 3      3 57135078      1       1           1           1
# 4      4 57135078      5       1           1           1
# 5      5 57164981      1       1           7           3
# 6      6 57191909      1       3           1           1
head(attr(temp, "col.label")) # What are the variable labels?
# [1] "CASE IDENTIFICATION NUMBER"             "RESPONDENT IDENTIFIER"                 
# [3] "ROMANTIC RELATIONSHIP NUMBER"           "RELATIONSHIP PREGNANCY NUMBER"         
# [5] "S23Q1 1 TOLD PARTNER PREGNANT-W3"       "S23Q2 MONTHS PREG WHEN TOLD PARTNER-W3"

What about with the original example?

read.dat.dct("test.dat", "test.dct", labels.included = "no")
#   code call city neigh            name
# 1    C 1245    A   101 George Costanza
# 2    B 1223    B    11    Cosmo Kramer
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • 1
    @sdaza - not to be too snarky, but the people of stackoverflow are not your personal research staff. You asked a vague question, which I added to heavily to make it answerable. You have now been given a close-to-generalisable answer by Ananda. At some stage there is an expectation that you will build on the info provided rather than constantly moving the goal posts. – thelatemail Jan 10 '13 at 01:48
  • Thank you @thelatemail. I really appreciate all the great help you and others provide. It wasn't my intention to be snarky. Ananda's solution is simply great! I will try to go through and see if I can solve my problem. The intention of my question was just to know if someone had done something similar before, but apparently this issue is not that common. It doesn't seem straightforward to get these .dat and .dct files directly from R. My last solution will be to use first STATA and then import to R. Thank you everyone again! – sdaza Jan 10 '13 at 03:08
  • It looks great @Ananda Mahto. I still can't deal with my files. I added an example. Thanks! .dtc file goo.gl/qHZOk data goo.gl/FRGRF – sdaza Jan 10 '13 at 03:11
  • Thanks @AnandaMahto. I am not familiar with regular expressions, so I haven't been able to deal with this data format in the 4th column of one of my dct files: %12.7f. I tried using \\s+%(.*) but I get this: Warning message: In class(out) <- classes[x] : NAs introduced by coercion, and only NAs in the dataset produced. Your expression for this was: \\s+%([0-9]+)[a-z]. I will try with this: http://stackoverflow.com/questions/5917082/regular-expression-to-match-numbers-with-or-without-commas-and-decimals-in-text – sdaza Jan 11 '13 at 16:13
  • It worked well. Here you have another example: dct http://goo.gl/jmj9V dat http://goo.gl/Ix4yu. Most of the data I am using is restricted so I will use your function and if I have further problems I will let you know. I will try to improve the varname.varname variable name pattern using your function, so that we only get "varname". Thank you! – sdaza Jan 11 '13 at 17:27
10

You may be able to read the dat files using ?read.fwf as the .dat data is essentially just a fixed width data file.

See here - Organizing Messy Notepad data - using the column(X) values from the .dct dictionary file as the widths.

The dictionary file could be scraped using readLines to extract the info, which you could then pass to arguments in the read.fwf call.

E.g.: the 'variable names' align with the col.names= argument and, the 'storage types' align with the colClasses= argument.

There would be some manual handling in this though.

Community
  • 1
  • 1
thelatemail
  • 91,185
  • 12
  • 128
  • 188