0

I am trying to input a large (> 70 MB) fixed format text file into r. For a smaller file (< 1MB), I can use the read.fwf() function as shown below.

condodattest1a <- read.fwf(impfile1,widths=testcsv3$Varlen,col.names=testcsv3$Varname)

When I try to run the line of code below,

condodattest1 <- read.fwf(impfile,widths=testcsv3$Varlen,col.names=testcsv3$Varname)

I get the following error message:

Error: cannot allocate vector of size 2 Kb

The only difference between the 2 lines is the size of the input file.

The formatting for the file I want to import is given in the dataframe called testcsv3. I show a small snippet of the dataframe below:

> head(testcsv3)

  Varlen      Varname    Varclass Varsep Varforfmt
1      2         "V1" "character"      2    "A2.0"
2     15         "V2" "character"     17   "A15.0"
3     28         "V3" "character"     45   "A28.0"
4      3         "V4" "character"     48    "F3.0"
5      1         "V5" "character"     49    "A1.0"
6      3         "V6" "character"     52    "A3.0"

At least part of my problem is that I am reading in all the data as factors when I use read.fwf() and I end up exceeding the memory limit on my computer.

I tried to use read.table() as a way of formatting each variable but it seems I need a text delimiter with that function. There is a suggestion in section 3.3 in the link below that I could use sep to identify the column where every variable starts.

http://data.princeton.edu/R/readingData.html

However, when I use the command below:

condodattest1b <- read.table(impfile1,sep=testcsv3$Varsep,col.names=testcsv3$Varname, colClasses=testcsv3$Varclass)

I get the following error message:

Error in read.table(impfile1, sep = testcsv3$Varsep, col.names = testcsv3$Varname, : invalid 'sep' argument

Finally, I tried to use:

condodattest1c <- read.fortran(impfile1,lengths=testcsv3$Varlen, format=testcsv3$Varforfmt, col.names=testcsv3$Varname)

but I get the following message:

Error in processFormat(format) : missing lengths for some fields
In addition: Warning messages:
1: In processFormat(format) : NAs introduced by coercion
2: In processFormat(format) : NAs introduced by coercion
3: In processFormat(format) : NAs introduced by coercion

All I am trying to do at this point is format the data when they come into r as something other than factors. I am hoping this will limit the amount of memory I am using and allow me to actually input the file. I would appreciate any suggestions about how I can do this. I know the Fortran formats for all the variables and the column at which each variable begins.

Thank you,

Warren

Barranka
  • 20,547
  • 13
  • 65
  • 83
  • Take a look to the [ff package](http://cran.r-project.org/web/packages/ff/index.html). Or maybe it is worth creating a database and acces the data with RODBC – Barranka Feb 11 '14 at 21:37
  • Have a look at mnel's answer (most recent) in [here](http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r) – crogg01 Feb 11 '14 at 21:48

1 Answers1

0

Maybe this code works for you. You have to fill varlen with the field sizes and add the corresponding type strings (e.g. numeric, character, integer) to colclasses

my.readfwf <- function(filename,varlen,colclasses) {
  sidx <- cumsum(c(1,varlen[1:(length(varlen)-1)]))
  eidx <- sidx+varlen-1
  filecontent <- scan(filename,character(0),sep="\n")
  if (any(diff(nchar(filecontent))!=0))
    stop("line lengths differ!")
  nlines <- length(filecontent)
  res <- list()
  for (i in seq_along(varlen)) {
    res[[i]] <- sapply(filecontent,substring,first=sidx[i],last=eidx[i])
    mode(res[[i]]) <- colclasses[i]
  }
  attributes(res) <- list(names=paste("V",seq_along(res),sep=""),row.names=seq_along(res[[1]]),class="data.frame")
  return(res)
}
Georg Schnabel
  • 631
  • 4
  • 8