So I have about 300 large .csv files. I need to get column 1 and 2, and then sum columns from 3 to 20 for each file. What's the fastest way to do so? Thanks
-
1Not clear based on the description. You said you want to get column 1 and 2 and sum columns rest? Please show a small reproducible example and expected output\ – akrun Jul 10 '18 at 22:35
-
2Please read [how to ask good questions](https://stackoverflow.com/help/how-to-ask). Other good references include [reproducible examples](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and [minimal examples](https://stackoverflow.com/help/mcve). Furthermore, each of these has many many questions already answered on SO: (1) read many large CSV files into a list; (2) sum specific columns. – r2evans Jul 10 '18 at 22:36
-
1Unless it's a really extreme situation `sum` or `colSums` should work fine for the first part. To read in many files I would suggest to put them in the same folder or folders within the same folder and then use `dir` or `System` to get their names, etc, etc – Hack-R Jul 10 '18 at 22:40
-
Try this: `lst<-lapply(filenames,read.csv)` then `lst<-lapply(lst, function(x) cbind(x[1:2],s=rowSums(x[3:20])))`, does that do it? – r2evans Jul 10 '18 at 22:59
-
Yes sorry, I wasn't clear enough. Previously I was doing this: `d001 <- read.csv(file="d001.csv", header=TRUE, sep=",") transmute(d003,i, j , Flux= h01 + h02 + h03 + h04 + h05 + h06 + h07 + h08 + h09 + h10 + h11 + h12 + h13 + h14 + h15 + h16 + h11 + h22 + h23 + h24)` So I Can get a new dataset with columns i, j and Flux, and Flux is the sum of the rest of the columns. I need to do this for all 300 files, and then join them, so I was trying to find a loop but no success yet to do so; files are large (400 mb) – Ricardo Alonso Esparza Gamez Jul 11 '18 at 14:29
1 Answers
First put all files in a single folder.
filenames <- list.files(pattern = ".csv")
all <- lapply(filenames, function(name) {
readr:: read_csv(name)
})
Based on your description the list should have a length of 300, each item containing a dataframe. If you want to bind the rows of all dataframes you can use dplyr's bind_rows(). Below is a solution based on an example dataset with 4 cols per dataset.
# assumed dataset post importing into a list from step 1
listOfDataFrames <- vector(mode = "list", length = 4)
for (i in 1:4) {
listOfDataFrames[[i]] <- data.frame(Col1=sample(letters, 5, rep=F),
Col2=rnorm(5), Col3=rnorm(5))
}
all_df <- dplyr::bind_rows(listOfDataFrames)
required_df <- all_df[,1:2]
sum_of_cols <- apply(all_df[,3:4], 2, sum)
===== edited answer based on new information =====
If all your csv files d001 to d300 are in the same folder you can apply the same principles that I have shared above (I'm not sure about the efficiency of the code, but it should still work).
Step 1: load all data frames into a single list
Step 2: loop through the list performing the same transformation on all dataframes (each element of list)
Step 3: loop through list writing each element of list into a separate csv file
Here is an example code:
filenames <- list.files(pattern = ".csv")
all <- lapply(filenames, function(name) {
readr:: read_csv(name)
})
all_transformed <- lapply(all, function(df) {
transmute(df, i, j, sum = sum(h01:h24, na.rm = TRUE))
})
names(all_transformed) <- paste0("d",1:length(all_transformed))
lapply(1:length(all_transformed), function(i) write.csv(all_transformed[[i]],
file = paste0(names(all_transformed[i]), ".csv"),
row.names = FALSE))

- 341
- 2
- 10
-
Thanks Rachit, So what I need is this: `d003 <- read.csv(file="d003.csv", header=TRUE, sep=",") transmute(d003,i, j , Sum= h01 + h02 + h03 + h04 + h05 + h06 + h07 + h08 + h09 + h10 + h11 + h12 + h13 + h14 + h15 + h16 + h11 + h22 + h23 + h24)` Then, I need to do for all those files (names go from d001 to d300) and put it in one new csv file – Ricardo Alonso Esparza Gamez Jul 11 '18 at 14:00
-
-
Thank you! Extremely helpful, exactly what I was looking for! Still, I had to sum manually the h because it gave me this error: `In h01:h24 : numerical expression has 941017 elements: only the first used` Thanks a lot! – Ricardo Alonso Esparza Gamez Jul 12 '18 at 19:57