I am currently working in SAS , but I have used R for a very long time. I have some fixed width text files to read. Those read easily in SAS but I am literally experiencing hell to do the same in R. The File looks some what like this :
DP JAMES SILVA REY
2014
6
0
1723713652
2
0
DP ALEJANDRA NARVAEZ
2014
6
0
1723713456
6
0
DP NANYER PICHARDO
2014
6
0
1723713991
1
0
DP GABRIELA ANASI CASTILLO
2014
6
0
1723713240
3
0
It is not clear here , I have attached it , please find.
It reads easily in SAS using infile input.
SAS Code:
infile "filename.txt" lrecl=32767 ;
input
@001 park_cd $5.
@006 Title $15.
@021 first_name $25.
@046 middle_name $25.
@071 last_name $25.
@096 suffix $15.
@111 ADDRESS_1 $60.
@171 ADDRESS_2 $60.
@231 ADDRESS_3 $60.
@261 CITY $30.
@291 STATE_PROVINCE $2.
@293 ZIP $9.
@302 Ticket_Year $11.
@314 product_id $12.
@327 UNIT_PRICE $13.
@340 PURCHASE_DT $26.
@366 PURCHASE_QTY $12.
@378 TOTAL_PURCHASE_AMT $14. ;
run;
Now to do the same in R , I have been trying various things:
1)Atfirst read.fwf , Code:
dat1=read.fwf("D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt",
widths=c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14),
header = FALSE, sep = "\t",fill = TRUE,
skip = 0, col.names=c("park_cd","Title","first_name","middle_name","last_name","suffix",
"ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
" ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
"PURCHASE_QTY","TOTAL_PURCHASE_AMT "), fileEncoding = "ASCII")
But it returns just NA values for most of the fields and only some values that too in wrong positions.
Head(dat1) gives output:
park_cd Title first_name middle_name
1 DP JAMES SILVA
2
3 <NA>
4 <NA> <NA> <NA>
5 <NA> <NA>
6 2014 <NA> <NA>
last_name suffix
1 REY
2 <NA> <NA>
3 <NA> <NA>
4 <NA> <NA>
5 <NA> <NA>
6 <NA> <NA>
ADDRESS_1.
1
2 <NA>
3 <NA>
4 <NA>
5 <NA>
6 <NA>
ADDRESS_2 ADDRESS_3 CITY
1 NA NA
2 <NA> NA NA
3 <NA> NA NA
4 <NA> NA NA
5 <NA> NA NA
6 <NA> NA NA
STATE_PROVINCE X.ZIP Ticket_Year product_id UNIT_PRICE PURCHASE_DT PURCHASE_QTY
1 NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA
TOTAL_PURCHASE_AMT.
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
Output:
2) Now I use the Sascii package to call the SAS code in R. Code:
sas_imp <- "input
@001 park_cd $5.
@006 Title $15.
@021 first_name $25.
@046 middle_name $25.
@071 last_name $25.
@096 suffix $15.
@111 ADDRESS_1 $60.
@171 ADDRESS_2 $60.
@231 ADDRESS_3 $60.
@261 CITY $30.
@291 STATE_PROVINCE $2.
@293 ZIP $9.
@302 Ticket_Year $11.
@314 product_id $12.
@327 UNIT_PRICE $13.
@340 PURCHASE_DT $26.
@366 PURCHASE_QTY $12.
@378 TOTAL_PURCHASE_AMT $14. ;"
sas_imp.tf <- tempfile()
writeLines (sas_imp , con = sas_imp.tf )
parse.SAScii( sas_imp.tf )
read.SAScii( "filename.txt" , sas_imp.tf )
It too gives the same useless output as above.
3) Now I use the Laf package and the laf_open_fwf command like :
library(LaF)
data <- laf_open_fwf(filename="D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt",
column_types=rep("character",18),
column_names=c("park_cd","Title","first_name","middle_name","last_name","suffix",
"ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
" ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
"PURCHASE_QTY","TOTAL_PURCHASE_AMT "),
column_widths=c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14))
Then I convert it like :
library(ffbase)
my.data <- laf_to_ffdf(data)
head(as.data.frame(my.data))
But it gives output:
park_cd Title first_name middle_name last_name
1 DP JAMES SILVA REY
2 \r\n \r\n
3 JANDR A NARVAEZ
4 \r\n \r \n \r\n \r\n 20
5 PICHARDO
6 \r\n \r\n \r\n \r\n 2014\r\n 6\r\n
suffix
1
2 \r\n \r\n
3
4 14\r\n
5
6 0\r\n
ADDRESS_1.
1
2 2014\r\n 6\r\n 0\r\n 172
3
4 6\r\n 0\r\n 1723713456\r\n 6\r\n
5
6 1723713991\r\n 1\r\n 0\r\nDP
ADDRESS_2 ADDRESS_3 CITY
1 \r *\003
2 3713652\r\n 2\r\n 0\r\nDP A L *\003
3 \r\n *\003
4 0\r\nDP NANYER *\003
5 \r\n *\003
6 GABRIELA ANASI *\003
STATE_PROVINCE X.ZIP Ticket_Year product_id UNIT_PRICE PURCHASE_DT PURCHASE_QTY
1 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
2 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
3 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
4 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
5 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
6 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
TOTAL_PURCHASE_AMT.
1 \001
2 \001
3 \001
4 \001
5 \001
6 \001
4) Lastly read.table.ffdf like
library(ff)
library(stringr)
my.data1 <- read.table.ffdf(file="D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt",
FUN="read.fwf",
widths = c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14),
header=F, VERBOSE=TRUE,
col.names = c("park_cd","Title","first_name","middle_name","last_name","suffix",
"ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
" ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
"PURCHASE_QTY","TOTAL_PURCHASE_AMT "),
fileEncoding = "UTF-8",
transFUN=function(x){
z <- sapply(x, function(y) {
y <- str_trim(y)
y[y==""] <- NA
factor(y)})
as.data.frame(z)
} )
But result is same. The last solution I found in this page [http://r.789695.n4.nabble.com/read-table-ffdf-and-fixed-width-files-td4673220.html][1].
What am I doing wrong , am I putting the widths wrong ? Or I am getting the ideas wrong altogether? I have worked with many things in R and just cant believe that a so easy thing in SAS is so tough in R. I must be missing some easy things. If you have any idea on these type please help me folks .Thanks in advance.