Find difference of two columns in a CSV and store in third column in chunks

Question

I have a large CSV file (greater than 6GB). Preview of the file is given below:

ID,NUM,MMSRATE,SMSRATE,DATARATE
1,0100000109,623,233,331
2,0200000109,515,413,314
3,0600000109,611,266,662
4,0700000109,729,490,927
5,0800000109,843,637,736
6,0600000109,578,367,875

I want to find the difference of MMSRATE AND SMSRATE and store it in a new column PDRATE in the same csv file. Preview is as given below:

ID,NUM,MMSRATE,SMSRATE,DATARATE,PDRATE
1,0100000109,623,233,333,390
2,0200000109,515,413,314,102

I have about 1 million rows. I want to read the rows in chunks of (say 20000), perform the difference operation, then write it into a output CSV file, then read the next 20000 rows, perform operations on that and write it into the output CSV file and so on.

I wrote a code for reading the rows in chunks (of 2 for the case of simplicity), but I am not able to perform the difference of the 2 columns within my script. The code is given below:

chunk_size <- 2
con  <- file("input.csv", open = "r")
data_frame <- read.csv(con,nrows = chunk_size,quote="",header = TRUE,)
header <- names(data_frame)
print(header)
print(data_frame)
if(nrow(data_frame) == chunk_size) {
 repeat {
   data_frame <- read.csv(con,nrows = chunk_size, header = FALSE, quote="")
   names(data_frame)<-c(header)
   print(header)
   print(data_frame)
   if(nrow(data_frame) < chunk_size) {
     break
   }
 }
}

close(con)

I'm fairly new to Rscript. I'm running R Studio IDE in Windows.

NOTE: The leading zero in NUM column should be retained in the output CSV FILE. Please note that it is a requirement that I work on CHUNKS of rows from the CSV and not on the CSV file as a whole.

Why can't you read the 1M row at once ? thats not so big. then it's just a `df$PTDR <- df$MMSRATE - df$SMSRATE` in one pass. — Tensibai, Dec 01 '16 at 13:30
And if it's just a matter of csv, awk is probably best suited: `awk -F "," -v OFS="," 'NR==1{print $0,"PDRATE"} NR>1{end=$3-$4; print $0,end}' your_file.csv` — Tensibai, Dec 01 '16 at 13:34
@Tensibai That was the requirement given to me, i.e., perform coloumn difference operation on chunks of rows from the csv. So i can't read the entire csv file as a whole. Can u suggest possible solutions. — Raymond, Dec 01 '16 at 13:53
So it's homework, you should have say it. No one will answer with this requirements without knowing it is required to fullfill this. — Tensibai, Dec 01 '16 at 13:55
Ok, from the Q, I understood that OP has trouble reading 1 Million rows and hence he is reading in chunks. Thats why proposed data table. But, If reading in chunks was his assignment, the same fread solution could be looped accordingly. solutions listed here, could be his answers ? http://stackoverflow.com/questions/19894194/reading-in-chunks-at-a-time-using-fread-in-package-data-table — user5249203, Dec 01 '16 at 14:06
@user5249203 By looking at the script i have provided, u can see that i am being able to read the file in chunks from the CSV file. But, i don't know how to find difference of the 2 coloumns into a 3rd coloumn in my script. Also i don't know how to retain the leading zero from coloumn NUM in the output csv file — Raymond, Dec 01 '16 at 14:20

Tensibai · Accepted Answer · 2017-01-04T08:56:23.463

3

If I get it right, first create your output file then do the changes and write the result in output, appending at each iteration.

chunk_size <- 2
con  <- file("input.csv", open = "r")
data_frame <- read.csv(con,nrows = chunk_size,quote="",colClasses = c("integer","character","integer","integer","integer"), header = TRUE,)
header <- names(data_frame)

outfile="out.csv"
data_frame$PDRATE <- data_frame$MMSRATE - data_frame$SMSRATE
write.csv(data_frame,outfile,row.names=FALSE)

if(nrow(data_frame) == chunk_size) {
 repeat {
   data_frame <- read.csv(con,nrows = chunk_size, colClasses = c("integer","character","integer","integer","integer"), header = FALSE, quote="")
   names(data_frame)<-c(header)

   data_frame$PDRATE <- data_frame$MMSRATE - data_frame$SMSRATE
   # note parameters, append=TRUE and col.names=FALSE
   write.table(data_frame,outfile,sep=",",append=TRUE,qmethod="double",col.names=FALSE,row.names=FALSE) 

   if(nrow(data_frame) < chunk_size) {
     break
   }
 }
}

close(con)

Understanding the code above is left as exercise :)

side note after comments: you can't append with write.csv, the documentation state:

Attempts to change append, col.names, sep, dec or qmethod are ignored, with a warning.

you have to use write.table and specify the separator, and qmethod to match write.csv defaults.

edited Jan 04 '17 at 08:56

answered Dec 01 '16 at 14:12

Tensibai

15,557
1
37
57

Warning messages: 1: In write.csv(data_frame, outfile, append = TRUE, col.names = FALSE, : attempt to set 'append' ignored. – Raymond Dec 01 '16 at 14:34
append operation is not taking place. Out.csv is holding only values present in the last iteration. – Raymond Dec 01 '16 at 14:38
Brilliant, Thanks a lot. As a personal favor, is there any way to read only the NUM coloumn as character classtype? i know that by using the colClassType='character', i can read all the coloumns as strings. But i only want to read NUM coloumn as character. Is there a way to do that? Thanks in advance. – Raymond Dec 01 '16 at 15:08
In meeting, I think you can give columns class to read.csv. Checking later. – Tensibai Dec 01 '16 at 15:10
@Ramkrishna `colClasses` can take a vector of types, one per column. Edited this way. – Tensibai Dec 01 '16 at 15:34
Let's assume that a csv file has more than 100 coloumns. Then it would be very difficult to specify colClasses to every single coloumn, right?. What can be done at such an instance? – Raymond Dec 01 '16 at 20:12
This assumption is too broad, there's a bunch of way to achieve that depending on the starting case and the format of the second column. You should really take a step back and read an introduction to R as we're back to bases here. – Tensibai Dec 01 '16 at 20:29

user5249203 · Answer 2 · 2016-12-01T14:41:24.727

I came across an interesting package chunked

Solution 1

require(chunked)
data <- read_csv_chunkwise("Test.csv") %>% mutate(diff = MMSRATE - SMSRATE)

     ID       NUM MMSRATE SMSRATE DATARATE  diff
  (int)     (int)   (int)   (int)    (int) (int)
1     1 100000109     623     233      331   390
2     2 200000109     515     413      314   102
3     3 600000109     611     266      662   345
4     4 700000109     729     490      927   239
5     5 800000109     843     637      736   206
6     6 600000109     578     367      875   211

you can specify the chunk size, read the documentation.

Solution 2

If you intended to use datatable

use fread to read your csv file

require(data.table)
mydata <- fread("file.csv", sep = ",", header= TRUE)

fread creates a data.table

Then, all you have to do is

mydata [ ,`:=`(Diffcol = MMSRATE-SMSRATE)]
mydata 

 ID        NUM MMSRATE SMSRATE DATARATE Diffcol
1:  1 0100000109     623     233      331     390
2:  2 0200000109     515     413      314     102
3:  3 0600000109     611     266      662     345
4:  4 0700000109     729     490      927     239
5:  5 0800000109     843     637      736     206
6:  6 0600000109     578     367      875     211

Note: datatable, is a very handly tool to work with large files. Also, allows you to group table output, and other calculations as well. You can learn more from this Cheat sheet

Don't edit question to break their mcve. A csv input is perfectly ok as mcve, and more usefull than a table like post. specially when the Q is about working with csv files. — Tensibai, Dec 01 '16 at 13:58

Find difference of two columns in a CSV and store in third column in chunks

2 Answers2