6

I want to read in a CSV file whose first line is the variable names and subsequent lines are the contents of those variables. Some of the variables are numeric and some of them are text and some are even empty.

file = "path/file.csv"
f = file(file,'r')
varnames = strsplit(readLines(f,1),",")[[1]]
data = strsplit(readLines(f,1),",")[[1]]

Now that data contains all the variables, how do I make it so that data can recognise the data type being read in just like if I did read.csv.

I need to read the data line by line (or n lines at a time) as the whole dataset is too big to be read into R.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
xiaodai
  • 14,889
  • 18
  • 76
  • 140
  • 1
    I'm unclear on why `read.csv('file.csv')` isn't working for you? – Greg May 25 '11 at 04:53
  • Say I am reading in a 100G file. I want to read 500MB at a time, do some processing and then read the next 500MB. I just can't fit 100G in memory. – xiaodai May 25 '11 at 04:55
  • 4
    The read.* family of functions allows you to read only so many lines and to skip blocks as well, so you ought to use those parameters rather than attempting to parse lines. – IRTFM May 25 '11 at 04:58
  • 1
    I agree with the comment of @xiaodai below, that what's really needed here hasn't yet been addressed because the proposed functions are super slow: to read one line at a time in a file with N lines will take $O(N^2)$ time. If one is willing to write some low-level code, and if R has random-access files, it should be possible to get it down to $O(N)$ simply by not losing the random-access position in between calls. – daveagp Nov 21 '17 at 18:36

7 Answers7

11

Based on DWin's comment, you can try something like this:

read.clump <- function(file, lines, clump){
    if(clump > 1){
        header <- read.csv(file, nrows=1, header=FALSE)
        p = read.csv(file, skip = lines*(clump-1), 
       #p = read.csv(file, skip = (lines*(clump-1))+1 if not a textConnection           
            nrows = lines, header=FALSE)

        names(p) = header
    } else {
        p = read.csv(file, skip = lines*(clump-1), nrows = lines)
    }
    return(p)
}

You should probably add some error handling/checking to the function, too.

Then with

x = "letter1, letter2
a, b
c, d
e, f
g, h
i, j
k, l"


>read.clump(textConnection(x), lines = 2, clump = 1)
  letter1 letter2
1       a       b
2       c       d

> read.clump(textConnection(x), lines = 2, clump = 2)
  letter1  letter2
1       e        f
2       g        h

> read.clump(textConnection(x), lines = 3, clump = 1)
  letter1 letter2
1       a       b
2       c       d
3       e       f


> read.clump(textConnection(x), lines = 3, clump = 2)
  letter1  letter2
1       g        h
2       i        j
3       k        l

Now you just have to *apply over clumps

Greg
  • 11,564
  • 5
  • 41
  • 27
  • I think there is a small error in there: if you do not pass a textConnection (or another connection) but a filename, you need to skip one more line in the first `p = read.csv`, because then the file is reopened after reading the header line. – Nick Sabbe May 25 '11 at 09:31
  • 2
    I think this solution is good. But have you thought about the implications of using skip = ? To skip to the 1 millionth and 1st record, say, in the background R is reading through the first 1 millionth record! So essentially it's really slow even though it doesn't much memory. Is there a way to read 1 million record, say, then stop, perform some actions on the read data, discard the data from memory then read the next 1 million records without having to read through the previous million (or more) first!! – xiaodai May 30 '11 at 05:51
6

An alternate strategy that has been discussed here before to deal with very big (say, > 1e7ish cells) CSV files is:

  1. Read the CSV file into an SQLite database.
  2. Import the data from the database with read.csv.sql from the sqldf package.

The main advantages of this are that it is usually quicker and you can easily filter the contents to only include the columns or rows that you need.

See how to import CSV into sqlite using RSqlite? for more info.

Community
  • 1
  • 1
Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
4

Just for fun (I'm waiting on a long running computation here :-) ), a version that allows you to use any of the read.* kind of functions, and that holds a solution to a tiny error in \Greg's code:

read.clump <- function(file, lines, clump, readFunc=read.csv,
    skip=(lines*(clump-1))+ifelse((header) & (clump>1) & (!inherits(file, "connection")),1,0),
    nrows=lines,header=TRUE,...){
    if(clump > 1){
            colnms<-NULL
            if(header)
            {
                colnms<-unlist(readFunc(file, nrows=1, header=FALSE))
                print(colnms)
            }
      p = readFunc(file, skip = skip,
          nrows = nrows, header=FALSE,...)
            if(! is.null(colnms))
            {
        colnames(p) = colnms
            }
    } else {
        p = readFunc(file, skip = skip, nrows = nrows, header=header)
    }
    return(p)
}

Now you can pass the relevant function as parameter readFunc, and pass extra parameters too. Meta programming is fun.

Nick Sabbe
  • 11,684
  • 1
  • 43
  • 57
3

On a sidenote : If you really have that huge data, there are (next to the SQLite solution) different packages that will help you handle that without having to resort to tricks as described in these answers.

There's the ff and the bigmemory package with friends biganalytics, bigtabulate 'biglm' and so on. For an overview, see eg.

Community
  • 1
  • 1
Joris Meys
  • 106,551
  • 31
  • 221
  • 263
1

I would try the LaF package:

Methods for fast access to large ASCII files... It is assumed that the files are too large to fit into memory... Methods are provided to access and process files blockwise. Furthermore, an opened file can be accessed as one would an ordinary data.frame...

I was able to get it seemingly working with the sample code below and it seemed to have the performance you'd expect from a streaming implementation. However, I would recommend you run your own time tests too.

library('LaF')

model <- detect_dm_csv('data.csv', header = TRUE, nrows = 600)  # read only 600 rows for type detection

mylaf <- laf_open(model)

print(mylaf[1000])  # print 1000th row
daveagp
  • 2,599
  • 2
  • 20
  • 19
1

I think using disk.frame's csv_to_disk.frame and setting in_chunk_size would be awesome for this use-case. E.g

library(disk.frame)
csv_to_disk.frame("/path/to/file.csv", in_chunk_size = 1e7)
xiaodai
  • 14,889
  • 18
  • 76
  • 140
0

You can use chunked or disk.frame if you don't mind a little tinkering to write out your data.

Both have options to let you read a data chunk-by-chunk

xiaodai
  • 14,889
  • 18
  • 76
  • 140