I have a large CSV file (greater than 6GB). Preview of the file is given below:
ID,NUM,MMSRATE,SMSRATE,DATARATE
1,0100000109,623,233,331
2,0200000109,515,413,314
3,0600000109,611,266,662
4,0700000109,729,490,927
5,0800000109,843,637,736
6,0600000109,578,367,875
I want to find the difference of MMSRATE AND SMSRATE and store it in a new column PDRATE in the same csv file. Preview is as given below:
ID,NUM,MMSRATE,SMSRATE,DATARATE,PDRATE
1,0100000109,623,233,333,390
2,0200000109,515,413,314,102
I have about 1 million rows. I want to read the rows in chunks of (say 20000), perform the difference operation, then write it into a output CSV file, then read the next 20000 rows, perform operations on that and write it into the output CSV file and so on.
I wrote a code for reading the rows in chunks (of 2 for the case of simplicity), but I am not able to perform the difference of the 2 columns within my script. The code is given below:
chunk_size <- 2
con <- file("input.csv", open = "r")
data_frame <- read.csv(con,nrows = chunk_size,quote="",header = TRUE,)
header <- names(data_frame)
print(header)
print(data_frame)
if(nrow(data_frame) == chunk_size) {
repeat {
data_frame <- read.csv(con,nrows = chunk_size, header = FALSE, quote="")
names(data_frame)<-c(header)
print(header)
print(data_frame)
if(nrow(data_frame) < chunk_size) {
break
}
}
}
close(con)
I'm fairly new to Rscript. I'm running R Studio IDE in Windows.
NOTE: The leading zero in NUM column should be retained in the output CSV FILE. Please note that it is a requirement that I work on CHUNKS of rows from the CSV and not on the CSV file as a whole.