21

How can I read big data formated with fixed width? I read this question and tried some tips, but all answers are for delimited data (as .csv), and that's not my case. The data has 558MB, and I don't know how many lines.

I'm using:

dados <- read.fwf('TS_MATRICULA_RS.txt', width=c(5, 13, 14, 3, 3, 5, 4, 6, 6, 6, 1, 1, 1, 4, 3, 2, 9, 3, 2, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 4, 11, 9, 2, 3, 9, 3, 2, 9, 9, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1), stringsAsFactors=FALSE, comment.char='', 
    colClasses=c('integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'character', 'character', 'character',
    'integer', 'integer', 'character', 'integer', 'integer', 'character', 'integer', 'character', 'character', 'character', 'character', 'character', 'character',
    'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character',
    'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'integer',
    'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'character', 'integer', 'integer', 'character', 'character', 'character',
    'character', 'integer', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character'), buffersize=180000)

But it takes 30 minutes (and counting...) to read the data. Any new suggestions?

Community
  • 1
  • 1
Rcoster
  • 3,170
  • 2
  • 16
  • 35
  • I have no idea how to speed it up, but I also save huge files like that as an R object after they load, so I can load them much faster in the future. – dayne Sep 10 '13 at 13:22
  • Yes, me too. But all I need from this file (for now) is a simple table, and then load another as big as :/ – Rcoster Sep 10 '13 at 13:24
  • I just read in a 4GB FWF using the following strategy: Load on cluster using the "big ram" queue (30GB). Save as R object. Took all night. So it goes with big data. Hopefully someone has a more efficient strategy though. – Ari B. Friedman Sep 10 '13 at 13:25
  • If your source file's elements are "well-behaved" enough, perhaps you can do a global replace with some text editor to substitute "," for whitespace and then use `read.csv`? Another wild-ass-guess: read the file into `R` with `readLines` and run `read.fwf` from `stdin` ? I've never tried that so I may be completely offbase. – Carl Witthoft Sep 10 '13 at 13:46
  • After 1h waiting: `Erro em scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() esperava 'an integer', obteve '110021407656'` :((( – Rcoster Sep 10 '13 at 13:54
  • can't you make a csv of your file ? If so you should be able to load your data in 10-20 seconds tops – statquant Sep 10 '13 at 14:00
  • Any sugestion how to convert to CSV before load it? It has more than 2mi lines, so Excel can't read :/ (Ok, I can use SPSS, it takes only 2 minutes, but yesterday I said to my boss "Dont renew SPSS license, it's useless, R can do everything and better!") – Rcoster Sep 10 '13 at 14:07
  • Yeah, let notepad++ do the conversion. Or SciTe, or just about any text editor, as both I and statquant suggested. – Carl Witthoft Sep 10 '13 at 14:29
  • I need any plugin to do it in Notepad++? – Rcoster Sep 10 '13 at 14:42
  • no just open the file with notepad (or vim) and convert your file replacing your spaces by `,`. BTW just to check... your file is NOT TAB separated right (else it is already a csv)? – statquant Sep 10 '13 at 14:47
  • Ahhhh! No no, not all variables have a space between them, so I can't just replace space by `,`. – Rcoster Sep 10 '13 at 15:16
  • 1
    You have an index of the widths? try sqldf with substr. Or create a csvkit schema file and use csvkit to create your CSV and read the CSV with dread from data.table. – A5C1D2H2I1M1N2O1R2T1 Sep 10 '13 at 15:25
  • @Rcoster I was a little surprised at the "just substitute the spaces" comments also, since "fixed width" has a pretty fixed definition. I wonder if there's some fast little UNIX utility that would insert a comma at each of a vector of spaces. – Ari B. Friedman Sep 10 '13 at 15:25
  • @AnandaMahto That looks like it would be worth an answer with benchmarks :-) – Ari B. Friedman Sep 10 '13 at 15:30
  • Trying something with `readLines()` and `substr()`... `readLines()` went ok, now let's see `substr()`... I'm using `mapply(function(x, start, stop) mapply(substr, x, start, stop), dados, MoreArgs=list(start=start, stop=stop))`, any other approach? – Rcoster Sep 10 '13 at 15:58
  • @AriB.Friedman `sed 's/ */,/g'` (there are 2 spaces in that expression; or `sed -r 's/ +/,/g'` with 1 space) – eddi Sep 10 '13 at 16:38
  • @AriB.Friedman, no benchmarks, but I've posted a potential answer. – A5C1D2H2I1M1N2O1R2T1 Sep 10 '13 at 18:12
  • @AnandaMahto Strong work. Surprised this question and the answers aren't getting more upvotes. – Ari B. Friedman Sep 11 '13 at 20:29
  • 1
    I'll try to update my answer later, but in the meantime, I wanted to share a package that you might be interested in: [iotools](https://github.com/s-u/iotools). – A5C1D2H2I1M1N2O1R2T1 Feb 05 '15 at 02:13

3 Answers3

11

Without enough details about your data, it's hard to give a concrete answer, but here are some ideas to get you started:

First, if you're on a Unix system, you can get some information about your file by using the wc command. For example wc -l TS_MATRICULA_RS.txt will tell you how many lines there are in your file and wc -L TS_MATRICULA_RS.txt will report the length of the longest line in your file. This might be useful to know. Similarly, head and tail would let you inspect the first and last 10 lines of your text file.

Second, some suggestions: Since it appears that you know the widths of each field, I would recommend one of two approaches.

Option 1: csvkit + your favorite method to quickly read large data

csvkit is a set of Python tools for working with CSV files. One of the tools is in2csv, which takes a fixed-width-format file combined with a "schema" file to create a proper CSV that can be used with other programs.

The schema file is, itself, a CSV file with three columns: (1) variable name, (2) start position, and (3) width. An example (from the in2csv man page) is:

    column,start,length
    name,0,30 
    birthday,30,10 
    age,40,3

Once you have created that file, you should be able to use something like:

in2csv -f fixed -s path/to/schemafile.csv path/to/TS_MATRICULA_RS.txt > TS_MATRICULA_RS.csv

From there, I would suggest looking into reading the data with fread from "data.table" or using sqldf.

Option 2: sqldf using substr

Using sqldf on a large-ish data file like yours should actually be pretty quick, and you get the benefit of being able to specify exactly what you want to read in using substr.

Again, this will expect that you have a schema file available, like the one described above. Once you have your schema file, you can do the following:

temp <- read.csv("mySchemaFile.csv")

## Construct your "substr" command
GetMe <- paste("select", 
               paste("substr(V1, ", temp$start, ", ",
                     temp$length, ") `", temp$column, "`", 
                     sep = "", collapse = ", "), 
               "from fixed", sep = " ")

## Load "sqldf"
library(sqldf)

## Connect to your file
fixed <- file("TS_MATRICULA_RS.txt")
myDF <- sqldf(GetMe, file.format = list(sep = "_"))

Since you know the widths, you might be able to skip the generation of the schema file. From the widths, it's just a little bit of work with cumsum. Here's a basic example, building on the first example from read.fwf:

ff <- tempfile()
cat(file = ff, "123456", "987654", sep = "\n")
read.fwf(ff, widths = c(1, 2, 3))

widths <- c(1, 2, 3)
length <- cumsum(widths)
start <- length - widths + 1
column <- paste("V", seq_along(length), sep = "")

GetMe <- paste("select", 
               paste("substr(V1, ", start, ", ",
                     widths, ") `", column, "`", 
                     sep = "", collapse = ", "), 
               "from fixed", sep = " ")

library(sqldf)

## Connect to your file
fixed <- file(ff)
myDF <- sqldf(GetMe, file.format = list(sep = "_"))
myDF
unlink(ff)
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • 1
    See [here](http://stackoverflow.com/questions/24715894/faster-way-to-read-fixed-width-files-in-r/34190156#34190156) for some benchmarks. I couldn't get the `sqldf` version to work (an error about there being no column named `V1`) so I excluded it for now. – MichaelChirico Dec 09 '15 at 22:24
11

The LaF package is pretty good at reading fixed width files very fast. I use it dayly to load in files of +/- 100Mio records with 30 columns (not that much character columns as you have - mainly numeric data and some factors). And it is pretty fast. So this is what I would do.

library(LaF)
library(ffbase)
my.data.laf <- laf_open_fwf('TS_MATRICULA_RS.txt', 
                  column_widths=c(5, 13, 14, 3, 3, 5, 4, 6, 6, 6, 1, 1, 1, 4, 3, 2, 9, 3, 2, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 4, 11, 9, 2, 3, 9, 3, 2, 9, 9, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1), stringsAsFactors=FALSE, comment.char='', 
                  column_types=c('integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'categorical', 'categorical', 'categorical',
                               'integer', 'integer', 'categorical', 'integer', 'integer', 'categorical', 'integer', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical',
                               'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical',
                               'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'integer',
                               'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'categorical', 'integer', 'integer', 'categorical', 'categorical', 'categorical',
                               'categorical', 'integer', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical'))
my.data <- laf_to_ffdf(my.data.laf, nrows=1000000)
my.data.in.ram <- as.data.frame(my.data)

PS. I started using the LaF package because I was annoyed by the slowness of read.fwf and because the PL/SQL PostgreSQL code which I was working with initially to parse the data was becoming a hassle to maintain.

  • For some reason it reads the 3rd variable with problem. `" 2012 8874432 110021407656 16 2 2004 8 240 180 0M11 76 43RS 4317400 43RS 4317400130 0000000000000000000000000 1 16 571764 0 0 43132715 43RS 4318002 512 00000100"` becomes `-1647742040` instead of `110021407656`. Any idea? – Rcoster Sep 10 '13 at 18:55
  • 1
    110021407656 is not an integer. See what as.integer(110021407656) gives in R and what does as.double(110021407656) give you? Adjust the column type of that column to 'double' in the code above. –  Sep 10 '13 at 19:03
  • 1
    @Rcoster hmmmm.... subtract one from the other and get 2^33*13. Any chance you're running into a 2^32 limit here? (Which is to say you definitely are) – Carl Witthoft Sep 10 '13 at 19:09
  • Tried this but it crashes (RStudio/R 3.0.0) – Ari B. Friedman Sep 10 '13 at 20:10
  • @Ari Are you using the exact code on the data of Rcoster or do you have your own code and data which seems to be causing you troubles? In the latter case, please share the code & data. –  Sep 11 '13 at 06:26
  • @jwijffels Have my own code/data, but can't share even a piece of the data due to Federal regulations. I have other ways of getting this data in, and I suspect this is an RStudio bug or something anyhow, so I've moved on. – Ari B. Friedman Sep 11 '13 at 14:16
  • Tried it in R from the console. Got " *** caught segfault *** address 0x1040c4bdf, cause 'memory not mapped'". – Ari B. Friedman Sep 11 '13 at 20:23
  • Can you make a reproducible example of this. –  Sep 11 '13 at 20:27
7

Here is a pure R solution using the new package readr, created by Hadley Wickham and the RStudio team, released in April 2015. More info here. The code is as simple as this:

library(readr)

my.data.frame <- read_fwf('TS_MATRICULA_RS.txt',
                      fwf_widths(c(5, 13, 14, 3, 3, 5, 4, 6, 6, 6, 1, 1, 1, 4, 3, 2, 9, 3, 2, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 4, 11, 9, 2, 3, 9, 3, 2, 9, 9, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1)),
                      progress = interactive())

Advantages of read_fwf{readr}

  • readr is based in LaF but surprisingly faster. It has shown to be the fasted method to read fixed-width files in R
  • It's simpler than the alternatives. e.g. you don't need to worry about column_types because they will be imputed from the first 30 rows on the input.
  • It comes with a progress bar ;)
Community
  • 1
  • 1
rafa.pereira
  • 13,251
  • 6
  • 71
  • 109
  • Reminder: read_fwf() trim all the character columns and this feature cannot be turned off. – Feng Jiang Feb 22 '17 at 20:47
  • what exactly you mean by 'trim all the character columns' ? – rafa.pereira Feb 22 '17 at 21:35
  • 1
    It will remove white spaces of both left/right sides from all the character columns. See LaF can turn it off: https://www.rdocumentation.org/packages/LaF/versions/0.6.3/topics/laf_open_fwf . read_delim in readr can also turn it off: https://www.rdocumentation.org/packages/readr/versions/1.0.0/topics/read_delim – Feng Jiang Feb 22 '17 at 22:07