1

I have a folder with many csv files in it. They are all structured as per picture

enter image description here

I am interested to count the numbers under my variable "Type" and get an output that tells me that there are two number 7, two number 9, two 1 and so on. I want to do this for the csv files in my folder, and it would be great to bind the outputs from different files together (with an identifier to the original file the output was extracted from). So far, I managed to do it for individual files with this code:

mydata <- read.csv("1_data.csv", skip=1, header = T)
df <- data.frame(table(mydata$Type))

However, I tried to code a loop and got stucked. This is the code I am using:

files = list.files(pattern = "*.csv")

for (i in files) {
  id <- substr(i, 1, 5)
  mydata <- read.csv (i, skip=1, header = T)
  datatobind <- data.frame(table(mydata$Type))
  datatobind["id"] <- as.numeric(id)
  data <- rbind(data, datatobind)
}
do.call (rbind, data)

write.csv(data, file='final.csv', row.names=FALSE)

I get a different error every time I try to change the code, so I am not sure how to fix this.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • try reading the files to a list first, using something like `L <- lapply( files, data.table::fread, skip = 1 )`, and then rowbind the list using `DT <- data.table::rbindlist(L, use.names = TRUE, fill = TRUE )` – Wimpel Sep 09 '20 at 09:27
  • https://stackoverflow.com/questions/11433432/how-to-import-multiple-csv-files-at-once – jogo Sep 09 '20 at 09:29
  • Hi both, thanks for your reply. Unfortunately, I am interested in merging the outputs together (so after I counted the numbers under my "Type" variable). This is because I am trying to identify errors and I need to see in which file the errors are. The binded output CSV file I have in mind should contain a header with the file name or a blank row to make it clear. - but this would be after I figure out how to do the loop – PsychSilvia Sep 09 '20 at 10:09

2 Answers2

0

Here are couple of ways to do count Type column from each file, add a new column with the filename and bind the output together.

Using base R :

files = list.files(pattern = "*.csv", full.names = TRUE)

new_data <- do.call(rbind, lapply(files, function(x) {
                  mydata <- read.csv(x, skip=1, header = TRUE)
                  transform(as.data.frame(table(mydata$Type)), 
                            filename = basename(x))
            }))

and with tidyverse :

library(dplyr)

new_data <- purrr::map_df(files, function(x) {
  mydata <- read.csv(x, skip=1, header = TRUE)
  mydata %>%
    count(Type) %>%
    mutate(filename = basename(x))
})
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
0

Here is a parallel version that suits your needs. You may need to install doSNOW and parallel packages:

library(doSNOW)
library(parallel)

setwd("path/to/folder")

all_files = list.files(pattern = "\\.csv$")
num_files = length(all_files)

cl <- makeCluster(min(num_files, floor(detectCores()*0.9)), outfile = "")
registerDoSNOW(cl)
dataset <- foreach(i=1:num_files, .combine='rbind') %dopar% 
{
  read.csv(all_files[i], header=TRUE)
}
stopCluster(cl)
registerDoSEQ()

write.csv(dataset, file='final.csv', row.names=FALSE)

Tested on Windows 10 x64, with massive speedup vs regular loop.

tomasvc
  • 41
  • 1
  • 2