-3

I am trying to read some data from the Roper Center into R to do some analysis with it. The older data sometimes comes in only ASCII format, it is just a data file of numbers, sometimes with no spaces or delimiters. Also every person has several rows. Here is an example

0001   01 06722121     101632          3113581R50106                050110M323
0001   0202089917300208991744  100154109020B73013.22        1O                
0001   039049MON FEB  8 1999 05:30pm   1 8   0208991830 6:30PM         05071  
0001   04                5                                       51           
0001   052206  32     1    21                     111                         
0001   06        1122223413323                      1122160921080711122112  11
0001   0722221205111223241121212220612111111122 21 2222                     
0002   01 09318035     001582          2123551R00106                0501I333
0002   0202089917320208991746   50074616080B42014.20        1O              
0002   039039MON FEB  8 1999 05:31pm   1 8   0208991831 6:31PM         05041  
0002   04                2                                       61         
0002   05 206  32     3    11                     121                         
0002   06        1245545554555                      1152080614031221121131  11
0002   0752321202112112322112434410722131242122 21 122222

I changed some numbers in there, hopefully I didn't mess it up but I think you need a subscription to the Roper Center to get this data.

I need to extract several elements for each respondent and put them into columns. Ill be doing this many times so code that only works for this case is not practical.

I have been using the package readr in R so far, but now that there are many rows per person its becoming more complicated and I wondered if anyone knew of a fast way to handle this with a R package or simple function.

A good example would be to get all of the weights in this sample. Those occur in columns 13-15 and are found in the first row for each person.

debo
  • 372
  • 2
  • 11
  • 1
    Please provide the desired output given the input example (or even a smaller sample of that), because to me it's not completely clear what you are trying to accomplish – digEmAll Jan 05 '17 at 20:35
  • I just added an edit that should make it more clear. – debo Jan 05 '17 at 20:51
  • Looks like a duplicate of: http://stackoverflow.com/questions/15596679/read-observations-in-fixed-width-files-spanning-multiple-lines-in-r. R works best when your input data is tidy (clean and rectangular). If you have odd formats like this, it may be best to pre-process them elsewhere. Programs like SAS do a better job reading crazy formats like this. – MrFlick Jan 05 '17 at 21:35
  • Were 4 down votes really necessary? This isn't an exact duplicate even though its similar. That method in the linked question would not work for my example because that dates is separated by spaces, mine is not. Something else would have to be used. – debo Jan 06 '17 at 00:29

1 Answers1

-1

Cool solution: your files come with a dictionary of fixed widths, right? In that case, use readr::read_fwf

Ugly solution below. Will probably choke if you have a lot of data, and might (no, will) fail to separate some variables.

x designates your ASCII file.

library(dplyr)
library(readr)

x <- read_lines(x)
x <- data_frame(
  uid = str_sub(x, 1, 4), # careful here, assuming UIDs are 4-length
  txt = str_sub(x, 8)     # careful here too
)

x <- lapply(unique(x$uid), function(y) {
  paste0(x$txt[ x$uid == y], collapse = " ") %>%
    strsplit("\\s+") %>%
    unlist %>%
    matrix(ncol = length(.)) %>%
    as_data_frame
}) %>%
  bind_rows %>%
  write_csv("whatever.csv")

You can now reimport the data with neat variable names and set the correct column types:

x <- read_csv(x, col_names = c(
  # column names
),
col_types = "cccciiii -- etc.")
Fr.
  • 2,865
  • 2
  • 24
  • 44
  • Im kind of interested in that package, what I did was use python and wrote something that parsed the files. When I looked around nothing could handle strictly fixed width with no delimiters. I wonder if that package can, Ill check it out. – debo Mar 01 '17 at 05:01