3

I am currently working in SAS , but I have used R for a very long time. I have some fixed width text files to read. Those read easily in SAS but I am literally experiencing hell to do the same in R. The File looks some what like this :

    DP                  JAMES                    SILVA                    REY                                                                                                                                                             




       2014
           6
            0
                1723713652
           2
             0
DP                  ALEJANDRA                                         NARVAEZ                                                                                                                                                         




       2014
           6
            0
                1723713456
           6
             0
DP                  NANYER                                            PICHARDO                                                                                                                                                        




       2014
           6
            0
                1723713991
           1
             0
DP                  GABRIELA                 ANASI                    CASTILLO                                                                                                                                                        




       2014
           6
            0
                1723713240
           3
             0

It is not clear here , I have attached it , please find.

It reads easily in SAS using infile input.

SAS Code:

infile  "filename.txt" lrecl=32767 ;

input

@001 park_cd      $5.

@006 Title $15.

@021 first_name  $25.

@046 middle_name $25.

@071 last_name   $25.

@096 suffix      $15.

@111 ADDRESS_1   $60.

@171 ADDRESS_2   $60.

@231 ADDRESS_3   $60.

@261 CITY  $30.

@291 STATE_PROVINCE    $2.

@293  ZIP    $9.

@302 Ticket_Year  $11.

@314 product_id  $12.

@327 UNIT_PRICE  $13.

@340 PURCHASE_DT $26.

@366 PURCHASE_QTY $12.

@378 TOTAL_PURCHASE_AMT $14. ;

run;

Now to do the same in R , I have been trying various things:

1)Atfirst read.fwf , Code:

dat1=read.fwf("D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt", 
 widths=c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14), 
header = FALSE, sep = "\t",fill = TRUE,
 skip = 0, col.names=c("park_cd","Title","first_name","middle_name","last_name","suffix",
   "ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
    " ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
       "PURCHASE_QTY","TOTAL_PURCHASE_AMT "), fileEncoding = "ASCII")

But it returns just NA values for most of the fields and only some values that too in wrong positions.

Head(dat1) gives output:

  park_cd           Title                first_name               middle_name
1   DP                    JAMES                     SILVA                    
2                                                                            
3                                                                        <NA>
4                    <NA>                      <NA>                      <NA>
5                                              <NA>                      <NA>
6                    2014                      <NA>                      <NA>
                  last_name          suffix
1 REY                                      
2                      <NA>            <NA>
3                      <NA>            <NA>
4                      <NA>            <NA>
5                      <NA>            <NA>
6                      <NA>            <NA>
                                                    ADDRESS_1.
1                                                             
2                                                         <NA>
3                                                         <NA>
4                                                         <NA>
5                                                         <NA>
6                                                         <NA>
                                                     ADDRESS_2 ADDRESS_3 CITY
1                                                                     NA   NA
2                                                         <NA>        NA   NA
3                                                         <NA>        NA   NA
4                                                         <NA>        NA   NA
5                                                         <NA>        NA   NA
6                                                         <NA>        NA   NA
  STATE_PROVINCE X.ZIP Ticket_Year product_id UNIT_PRICE PURCHASE_DT PURCHASE_QTY
1             NA    NA          NA         NA         NA          NA           NA
2             NA    NA          NA         NA         NA          NA           NA
3             NA    NA          NA         NA         NA          NA           NA
4             NA    NA          NA         NA         NA          NA           NA
5             NA    NA          NA         NA         NA          NA           NA
6             NA    NA          NA         NA         NA          NA           NA
  TOTAL_PURCHASE_AMT.
1                  NA
2                  NA
3                  NA
4                  NA
5                  NA
6                  NA

Output:

2) Now I use the Sascii package to call the SAS code in R. Code:

sas_imp <- "input
@001 park_cd      $5.
@006 Title $15.
@021 first_name  $25.
@046 middle_name $25.
@071 last_name   $25.
@096 suffix      $15.
@111 ADDRESS_1   $60.
@171 ADDRESS_2   $60.
@231 ADDRESS_3   $60.
@261 CITY  $30.
@291 STATE_PROVINCE    $2.
@293  ZIP    $9.
@302 Ticket_Year  $11.
@314 product_id  $12.
@327 UNIT_PRICE  $13.
@340 PURCHASE_DT $26.
@366 PURCHASE_QTY $12.
@378 TOTAL_PURCHASE_AMT $14. ;"
sas_imp.tf <- tempfile()
writeLines (sas_imp , con = sas_imp.tf )
parse.SAScii( sas_imp.tf )
read.SAScii( "filename.txt" , sas_imp.tf ) 

It too gives the same useless output as above.

3) Now I use the Laf package and the laf_open_fwf command like :

library(LaF)

data <- laf_open_fwf(filename="D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt",
                     column_types=rep("character",18),
  column_names=c("park_cd","Title","first_name","middle_name","last_name","suffix",
                  "ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
                  " ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
                  "PURCHASE_QTY","TOTAL_PURCHASE_AMT "),
                     column_widths=c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14))

Then I convert it like :

library(ffbase)
my.data <- laf_to_ffdf(data) 
head(as.data.frame(my.data))

But it gives output:

park_cd Title first_name             middle_name                      last_name
1      DP            JAMES                   SILVA                            REY
2                                             \r\n                           \r\n
3   JANDR     A                            NARVAEZ                               
4                     \r\n                      \r \n  \r\n         \r\n       20
5                 PICHARDO                                                       
6          \r\n            \r\n  \r\n         \r\n       2014\r\n           6\r\n
             suffix
1                  
2 \r\n         \r\n
3                  
4            14\r\n
5                  
6             0\r\n
                                                            ADDRESS_1.
1                                                                     
2         2014\r\n           6\r\n            0\r\n                172
3                                                                     
4 6\r\n            0\r\n                1723713456\r\n           6\r\n
5                                                                     
6                   1723713991\r\n           1\r\n             0\r\nDP
                                                           ADDRESS_2 ADDRESS_3  CITY
1                                                                           \r *\003
2 3713652\r\n           2\r\n             0\r\nDP                  A         L *\003
3                                                               \r\n           *\003
4                                    0\r\nDP                  NANYER           *\003
5                                                               \r\n           *\003
6                                     GABRIELA                 ANASI           *\003
  STATE_PROVINCE X.ZIP Ticket_Year product_id UNIT_PRICE PURCHASE_DT PURCHASE_QTY
1             ÐÆ *\003       "ADDR     ,"\001      *\003          \n           <N
2             ÐÆ *\003       "ADDR     ,"\001      *\003          \n           <N
3             ÐÆ *\003       "ADDR     ,"\001      *\003          \n           <N
4             ÐÆ *\003       "ADDR     ,"\001      *\003          \n           <N
5             ÐÆ *\003       "ADDR     ,"\001      *\003          \n           <N
6             ÐÆ *\003       "ADDR     ,"\001      *\003          \n           <N
  TOTAL_PURCHASE_AMT.
1                \001
2                \001
3                \001
4                \001
5                \001
6                \001

4) Lastly read.table.ffdf like

library(ff) 
library(stringr) 
my.data1  <- read.table.ffdf(file="D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt", 
                            FUN="read.fwf", 
                            widths = c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14), 
                            header=F, VERBOSE=TRUE, 
                            col.names = c("park_cd","Title","first_name","middle_name","last_name","suffix",
                                          "ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
                                          " ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
                                          "PURCHASE_QTY","TOTAL_PURCHASE_AMT "), 
                            fileEncoding = "UTF-8", 
                            transFUN=function(x){ 
                              z <- sapply(x, function(y) { 
                                y <- str_trim(y) 
                                y[y==""] <- NA 
                                factor(y)}) 
                              as.data.frame(z) 
                            } )

But result is same. The last solution I found in this page [http://r.789695.n4.nabble.com/read-table-ffdf-and-fixed-width-files-td4673220.html][1].

What am I doing wrong , am I putting the widths wrong ? Or I am getting the ideas wrong altogether? I have worked with many things in R and just cant believe that a so easy thing in SAS is so tough in R. I must be missing some easy things. If you have any idea on these type please help me folks .Thanks in advance.

Joe
  • 62,789
  • 6
  • 49
  • 67
Sayak
  • 183
  • 1
  • 11
  • Sorry I couldn't attach . Has added a bit portion of the file. The output data in SAS has 18 columns and 32066 rows. – Sayak Feb 19 '15 at 14:16
  • 1
    When I copy the file portion you posted above into a text editor (notepad++) I find it doesn't correspond to the column widths in the code. Many lines are less than 20 characters long. What happens when you try to open it in excel as a fixed width file ? Perhaps you can post the file itself or part of it, to site where we can access it ? – Robert Long Feb 19 '15 at 15:09
  • As an aside, don't necessarily be surprised when things are easier/tougher in different languages. SAS and R have very different backgrounds, and are focused on very different things. SAS has a mainframe background that worked originally and natively with fixed width files; while if I understand correctly R also originated from a language used on mainframes, it seems to have migrated away from the fixed width to focus more on delimited files. – Joe Feb 19 '15 at 15:18
  • The sep argument of read.fwf says: "character; the separator used internally; should be a character that does not occur in the file (except in the header)." Is this the case in your file? –  Feb 19 '15 at 16:24
  • @RobertLong , I got the widths from the infile statement in SAS, not sure about them, Should they be fixed in both R & SAS? It is occuring to me also that widths might have been wrong. But how can I get the right widths? You have any idea? – Sayak Feb 19 '15 at 19:03
  • You sample data seems to have a bunch of extra line breaks. Is that true of your data file? Are you trying to import the exact same file in both R and SAS? Or have you copied to another machine? Are you sure it's identical? Your sample data also seems to have tabs rather than spaces between columns. Is this what you expect? – MrFlick Feb 19 '15 at 20:25
  • Yeah , file can be in many formats , this sample is not self-explanatory. So ,I have uplaoded the file to a location, ayou can download and work with it, the location is : [link](http://www.filedropper.com/cedarfairdp2014hauntresults). Thanks. – Sayak Feb 20 '15 at 05:39
  • fixed width files are the worst. eventually I wrote a routine in python to convert FW to csv; it takes 6 seconds & all troubles are washed away – MichaelChirico Feb 20 '15 at 13:28
  • @MichaelChirico , thats great, if you can help me with the logic or the code maybe , I can do it in python and then read in r. – Sayak Feb 20 '15 at 13:51
  • I'll pass the code later today. – MichaelChirico Feb 20 '15 at 14:03
  • Thanx man. Will help a lot – Sayak Feb 20 '15 at 14:20
  • @sayak you say you got the widths from the SAS code, but where did the SAS code come from? That code does not run successfully against the file you uploaded (I just asked a friend to try it). So you must have uploaded a different version of the file to that which was used in SAS – Robert Long Feb 20 '15 at 16:42

2 Answers2

2

Update

Please see here for what I use nowadays for this problem:

Faster way to read fixed-width files

for posterity, original answer retained below as a how-to guide for bootstrapping solutions while desperate


Here's the FW -> .csv converter I created in Python to destroy these awful files:

It also includes the checkLength function that can help get at what @RobertLong suggested, which is that your underlying file might be faulty. If that's the case, you may be in trouble if it's pervasive & unpredictable (i.e. there are no consistent mistake patterns in your file that you can ctrl+H to fix).

Please note dictfile must be formatted correctly (I wrote this for myself, not necessarily to be as robust as possible)

import os
import csv
#Set correct directory
os.chdir('/home/michael/...') #match format of your OS

def checkLength(ffile):
    """
    Used to check that all lines in file have the same length (and so don't cause any issues below)
    """
    with open(ffile,'r') as ff:
        firstrow=1
        troubles=0
        for rows in ff:
            if firstrow:
                length=len(rows)
                firstrow=0
            elif len(rows) != length:
                print rows
                print len(rows)
                troubles=1
    return troubles

def fixed2csv(infile,outfile,dictfile):
    """
    This function takes a file name for a fixed-width dataset as input and 
    converts it to .csv format according to slices and column names specified in dictfile

    Parameters
    ==========
        infile: string of input file name from which fixed-width data is to be read
                e.g. 'fixed_width.dat'
        outfile: string of output file name to which comma-separated data is to be saved
                 e.g. 'comma_separated.csv'
        dictfile: .csv-formatted dictionary file name from which to read the following:
                     * widths: field widths
                     * column names: names of columns to be written to the output .csv
                     * types: object types (character, integer, etc)
                  column order must be: col_names,slices,types
    """
    with open(dictfile,'r') as dictf:
        fieldnames = ("col_names","widths","types") #types used in R later
        ddict = csv.DictReader(dictf,fieldnames)
        slices=[]
        colNames=[]
        wwidths=[]
        for rows in ddict:
            wwidths.append(int(rows['widths'])) #Python 0-based, must subtract 1
            colNames.append(rows['col_names'])
        offset = 0
        for w in wwidths:
            slices.append(slice(offset,offset+w))
            offset+=w
    with open(infile,'r') as fixedf:
        with open(outfile,'w') as csvf:
            csvfile=csv.writer(csvf)
            csvfile.writerow(colNames)
            for rows in fixedf:
                csvfile.writerow([rows[s] for s in slices])

Good luck, and curses be on whoever it is that is proliferating these FW format data files.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • I'm not sure this will work since the file is actually variable width – Robert Long Feb 20 '15 at 15:55
  • your suggestion to get rid of all of those `\n`s is the best way to get the file into proper FW format. I'm still in favor of cracking it to a delimited format thereafter--FW is a pain. – MichaelChirico Feb 20 '15 at 16:19
  • I don't like FW files either, but it seems to require a similar amount of effort (specifying column names and widths) in running your code as it does with importing a proper FW file in R. I would probably import it in excel rather than R since at least there is a wizard to use. – Robert Long Feb 20 '15 at 17:09
  • If it's a one-off thing, I agree. I've found that having the .csv around for future use is a good option; I wrote this code for a series of FWF I found, and originally used this to convert 20 or so files in a few seconds' time. – MichaelChirico Feb 20 '15 at 18:29
  • Thanx michael, I will work with it and let you know my issues. – Sayak Feb 21 '15 at 05:27
1

The file you uploaded is not a fixed width file:

enter image description here

I am not a SAS user, but from looking at the SAS code in your post, the column widths in the code do not match up with those in the file.

Moreover, some lines are entirely blank.

It appears that there are many carriage return / new lines which do not belong there - in particular they seem to be used in places as a delimiter. There should be one CRLF at the end of each line, and that's it.

Since you say that SAS opens it, I suggest you use save to a CSV format in SAS and then open it in R. Alternatively you could remove the superfluous CRLF using a good text editor/processor, leaving a single CRLF at the end of each line. Since it appears that each "real" line begins with "DP" you could try to do a replace of -CRLF-DP with (say) -tab- then delete all -CRLF-s then replace all -tab-s with -CRLF- (this relies on their being no -tab-s in the file already)

Robert Long
  • 5,722
  • 5
  • 29
  • 50
  • But SAS reds it nicely, without any problem. That CRLF does not appear too when I opened in notepad. Actually It is to be done in R. Cant use SAS, so any suggestions on that please?? – Sayak Feb 20 '15 at 13:53
  • 2
    @sayak just remove the excess carriage return line feeds as I suggested. Notepad++ can do it. Just choose "Search->Replace", select Extended mode, put "\r\n" as the text to search for (the escape code for the carriage return new line) and "" (ie nothing) to replace it with. Then replace all occurances of "DP" with "\r\nDP". This assumes the only occurances of DP are at the start of every line. I just did this myself and it works. Takes a few seconds. – Robert Long Feb 20 '15 at 15:59
  • Hi Robert its a long time. The requirement was not there at that time. I have been working with this now. So I did this thing in notepad++ and the file appears good. I read it with read.table it reads but not in proper alignment with the column names. Can you please tell how did it work for you ? What was the code you used after changing in notepad++? – Sayak Jul 07 '15 at 07:26