0

So I have a large csv excel file that my computer cannot handle opening without rstudio terminating.

To solve this I am trying to iterate through the rows of the file in order do my calculations on each row at a time, before storing the value and then moving on to the next row.

This I can normally achieve (eg on a smaller file) through simply reading and storing the whole csv file within Rstudio and running a simple for loop.

It is, however, the size of this storage of data that I am trying to avoid, hence I am trying to read a row of the csv file one at a time instead.

(I think that makes sense)

This was suggested :here

I have managed to get my calculations to be read and work quickly for the first row of my data file.

It is the looping over this that I am struggling with, as I am trying to use a for loop (potentially should be using a while/if statement) but I have nowhere for the "i" value to be called from within the loop: part of my code is below:

con = file(FileName, "r")
  for (row in 1:nrow(con)) {
    data <- read.csv(con, nrow=1) #reading of file
 "insert calculations here"
}

So the "row" is not called upon so the loop only goes through once. I also have an issue with the "1:nrow(con)" as clearly the nrow(con) simply returns NULL

Any help with this would be great, thanks.

Jaap
  • 81,064
  • 34
  • 182
  • 193
  • You have a couple of moving parts there that all need to be taken care of. For example: What does "large" mean? I'm not sure the process you use for reading the file is the most efficient one. There are ways to solve that but you should at least post a partial dataset. – Roman Oct 24 '18 at 15:10
  • Yes, it would be really helpful to know what the file looks like. Does it have column headings? If so, the reading is a bit harder. Do you know what types to read from each column? – user2554330 Oct 24 '18 at 16:37
  • The data is integers 200000 rows and 3000 columns, with the first row having columns headers only in the first 3 columns, this has caused some issues for me too, due to occasionally getting a more columns than headers. – chocolatekeyboard Oct 25 '18 at 14:11
  • If you only have partial headers, you probably want to skip that row and ignore them. I'll modify my answer to show how to deal with this. – user2554330 Oct 25 '18 at 15:17

2 Answers2

1

You could read in your data in batches of, say, 10,000 rows at a time (but you can change n to do as much as you want), do your calculations and then write the changes to a new file, appending the each batch to the end of the file.

Something like:

i = 0
n = 10000

while (TRUE) {
    df = readr::read_csv('my_file.csv', skip=i, n_max=n)
    # If the number of rows in the file is divisible by n, it may be the case
    # that the next pass will result in an empty data.frame being returned
    if (nrow(df) > 0) {
        # do your calculations
        # If you have performed calculations on df and want to save those results,
        # save the data.frame to a file, appending it to the file to avoid overwriting prior results.
        readr::write_csv(df, 'my_new_file.csv', append=TRUE)
    } else {
        break
    }

    # Check to see if we need to keep going, if so add n to i
    if (nrow(df) < n) {
        break
    } else {
        i = i + n
    }
}
tblznbits
  • 6,602
  • 6
  • 36
  • 66
  • Yes, you can use `read.csv` with the parmeters `skip` and `nrows`. But you should know that `read.csv` is **signifciantly** slower than `readr::read_csv`, so I would recommend against using that route unless you're unable to download packages on your computer. – tblznbits Oct 24 '18 at 15:29
  • Thanks, what is the purpose of the ` readr::write_csv(df, 'my_new_file.csv', append=TRUE) ` line? – chocolatekeyboard Oct 24 '18 at 15:44
  • That's for the event where you want to save your data after performing your calculations. – tblznbits Oct 24 '18 at 15:44
  • Is it not possible to save them within the workspace on rstudio? maybe in a vector? – chocolatekeyboard Oct 24 '18 at 15:52
  • Without seeing your data, I am forced to assume that the reason RStudio is crashing is because your data is too big to hold in memory. And if that's the case, then reading it in batches and trying to keep it in memory will not fix that problem. – tblznbits Oct 24 '18 at 15:54
  • My data is 200000:3000 and my computer is small. the data that I am trying to store would only be a 1:3000 vector – chocolatekeyboard Oct 24 '18 at 16:07
  • The data you are trying to store while doing the calculations might only be a single row. But once you're done with those calculations, you don't want to keep that row in memory. But, if you don't have any need to save that row once your calculations have been performed, then you can exclude the `readr::write_csv` line. – tblznbits Oct 24 '18 at 16:09
  • Won't `readr::read_csv` assume there's a header row on every batch? – user2554330 Oct 25 '18 at 15:31
  • @user2554330 Likely, but you can suppress that with `col_names=FALSE`, which will result in the columns being labeled `x1` through `xn`. So probably best to have a vector of column names available to overwrite those names once read in. – tblznbits Oct 25 '18 at 15:35
1

read.csv() will generate an error if it tries to read past the end of the file. So you could do something like this:

con <- file(FileName, "rt")
repeat {
   data <- try(read.csv(con, nrow = 1, header = FALSE), silent = TRUE) #reading of file
   if (inherits(data, "try-error")) break
   "insert calculations here"
}
close(con)

It will be really slow going one line at a time, but you can do it in larger batches if your calculation code supports that. And I'd recommend specifying the column types using colClasses in the read.csv() call, so that R doesn't guess differently sometimes.

Edited to add:

We've been told that there are 3000 columns of integers in the dataset. The first row only has partial header information. This code can deal with that:

n <- 1                           # desired batch size
col.names <- paste0("C", 1:3000) # desired column names
con <- file(FileName, "rt")
readLines(con, 1)                # Skip over bad header row
repeat {

   data <- try(read.csv(con, nrow = n, header = FALSE,
                        col.names = col.names,
                        colClasses = "integer"), 
               silent = TRUE) #reading of file
   if (inherits(data, "try-error")) break
   "insert calculations here"
}
close(con)
user2554330
  • 37,248
  • 4
  • 43
  • 90
  • Thanks, Is there any way to read from the 2nd column onwards? (e.g. ignoring the first column?) – chocolatekeyboard Oct 30 '18 at 12:15
  • If you put `row.names=1` in the `read.csv()` call, it will use the first column as row names. (There should be no repeats.) Otherwise, you can read using the code in the answer then drop the first column using `data <- data[-1]`. – user2554330 Oct 30 '18 at 13:51