0

Please help me as I am new to R and also programming

I am trying to write a loop in such that it should read the data for every 1000 rows and create a data-set in r

Following is my trial

for(i in 0:nl){
  df[i] = fread('RM.csv',skip = 1000*i, nrows =1000,
                col.names = colnames(read.csv('RM.csv', nrow=1, header = T)))
}

where nl is a integer and is equal to length of data 'RM.csv'

What I am trying to do is create a function which will skip every 1000 rows and read next 1000 rows and terminates once it reaches nl which is length of original data.

Now it is not mandatory to use only this approach.

  • Possible duplicate of [Strategies for reading in CSV files in pieces?](https://stackoverflow.com/questions/9352887/strategies-for-reading-in-csv-files-in-pieces) – Arkadii Kuznetsov Aug 07 '17 at 09:30

1 Answers1

2

You can try reading in the entire file into a single data frame, and then subsetting off the rows you don't want:

df <- read.csv('RM.csv', header=TRUE)
y <- seq(from = 0, to = 100000, by = 1)     # replace the 'to' value with a value
seq.keep <- y[floor(y / 1000) %% 2 == 0]    # large enough for the whole file
df.keep <- df[seq.keep, ]

Here is a rather messy demo which shows that the above sequence logic be correct:

Demo

You can inspect that the sequence generated is:

0-999
2000-2999
4000-4999
etc.

As mentioned in the code comment, make sure you generate a sequence large enough to accommodate the actual size of the data frame.

If you need to continue with your current approach, then try reading in only every other 1000 lines, e.g.

sq <- seq(from=0, to=nl, by=2)
names <- colnames(read.csv('RM.csv', nrow=1, header=TRUE))
for(i in sq) {
    df_i <- fread('RM.csv', skip=1000*i, nrows=1000, col.names=names)
    # process this chunk and move on
}
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • The size of file is 20 GB. Thus if I try to read whole data in one shot the system will crash. Thus I want to read chunk of data and perform a match function late on. Once everything is done it will write the data and start reading next chunk and so on. – user3301082 Aug 07 '17 at 09:36
  • @user3301082 I updated my answer. Just continue then with your current approach, but read the file using a sequence which targets the rows you want to read. – Tim Biegeleisen Aug 07 '17 at 09:48
  • updated code is throwing a Error: object 'df' not found. But when I changed df[i] to df_i, the loop ended up in a infinite loop. – user3301082 Aug 07 '17 at 11:57
  • I don't think an infinite loop is possible, but rather perhaps the loop is just taking a really long time to run. It could take a while to process 20GB of data in R. Try changing the bounds of the loop to a smaller number just to verify this. – Tim Biegeleisen Aug 07 '17 at 12:00
  • Initially I am running it using a small file sized 5MB, with 30,000 rows. – user3301082 Aug 07 '17 at 12:03
  • How many times does it loop? Can you add a print statement to the loop? – Tim Biegeleisen Aug 07 '17 at 12:40
  • [1] "Loop 0" [1] "Loop 2" [1] "Loop 4" [1] "Loop 6" [1] "Loop 8" [1] "Loop 10" [1] "Loop 12" [1] "Loop 14" [1] "Loop 16" [1] "Loop 18" [1] "Loop 20" [1] "Loop 22" [1] "Loop 24" [1] "Loop 26" [1] "Loop 28" [1] "Loop 30" [1] "Loop 32" – user3301082 Aug 07 '17 at 12:57
  • Outputs a df_i file with 1000 obs – user3301082 Aug 07 '17 at 12:57
  • Do you realize that `df_i` will be overwritten in each iteration of the loop? You can `rbind()` the pieces together if necessary. But based on your earlier comments, you wouldn't have space for 10GB anyway. – Tim Biegeleisen Aug 07 '17 at 13:02
  • Yes I realize it now. It ends up same were it all started. – user3301082 Aug 07 '17 at 13:13