2

I am dealing with a dataset that consists of over 90,000 csv.-files. Each csv.-file shows sampling data of a specific chemical measured at a specific sampling site. The files look like this:

#csv1
chemical_ID samplingsite    A   result  year    month   
1   1   1   0.5 2008    7
1   1   1   0.5 2008    5
1   1   1   0.5 2008    1
1   1   1   0.3 2008    11
1   1   1   0.5 2010    6
1   1   1   0.4 2010    10
1   1   1   0.5 2010    2
1   1   1   0.5 2010    4
1   1   1   0.4 2013    3
1   1   0   0.2 2013    5
1   1   0   0.1 2013    7
1   1   1   0.5 2013    9
1   1   1   0.4 2014    3
1   1   0   0.2 2014    5
1   1   0   0.1 2014    7
1   1   1   0.5 2014    9

#csv2
chemical_ID samplingsite    A   result  year    month
2   1   1   0.8 2008    6
2   1   1   0.7 2008    9
2   1   1   0.9 2008    11
2   1   1   0.6 2008    12
2   1   1   0.5 2010    2
2   1   1   0.4 2010    5
2   1   1   0.8 2010    6
2   1   1   0.9 2010    8

#csv3
chemical_ID samplingsite    A   result  year    month
100 2   1   1.5 2001    1
100 2   1   1.2 2001    6
100 2   1   1.7 2002    1
100 2   1   0.9 2002    6
100 2   1   1.8 2003    1
100 2   0   1.4 2003    6
100 2   1   1.5 2004    1
100 2   0   1.2 2004    6

To reduce the amount of files I would like to select only the files that match specific criteria and save them in a new folder. Criteria for each chemical shall be:

Number of sampled years > 4
Number of samplings per year >= 4
Number of factor “1” in column “A” per year >= 4

I’ve tried but can’t find a solution for my task and google wasn’t helpful at all. This is what I’ve got so far:

{
mycsv=list.files(path="D:/…/in ", pattern="allyears")
n <- length(mycsv)
mylist <- vector("list", n)

for(i in 1:n)
mylist[[i]] <- read.csv(mycsv[i], header = TRUE)
mylist <- lapply(mylist, FUN=function(x) length(unique(x$year)))
#???

for(i in 1:n)
write.csv(file = paste("D:/…/out", mycsv[i], sep = ""), 
mylist[i], row.names = F)
}

Thanks in advance
Nis

Nis
  • 23
  • 3
  • create a function to read a csv, check the columns match your criteria and return e.g. `TRUE` if they do, then apply said function to `mylist`, then use the result to cull `mylist` – Scransom Jun 12 '18 at 09:27

2 Answers2

1

This approach gets the list of your files, creates a function to check your criteria, checks the if the files meet these criteria with the function, makes a list of files meeting your function, and then writes the csvs to a new folder (that you must have created).

The example is designed to work with the csvs in your question, where none meet your criteria as I interpret them, so a test criteria is added where csv1 meets your criteria. To switch these off, just remove the # from your criteria, and put a # in front of the test criteria.

file.list <- list.files() # gets list of files - assumes your working directory is where the files are

check.csv <- function(csv.path){ #checks your criteria

  the.csv <- read.csv(file = csv.path, header = TRUE)

  sampled.years <- length(unique(the.csv$year))

  min.samples.per.year <- min(table(the.csv$year))

  min.f1A <- min(table(the.csv$year, the.csv$A)[,"1"])

  #your criteria
  #meets.criteria <- ifelse(sampled.years > 4 & min.samples.per.year >= 4 & min.f1A >=4, TRUE, FALSE) 

  #test criteria
  meets.criteria <- ifelse(sampled.years >= 4 & min.samples.per.year >= 4 & min.f1A >= 2, TRUE, FALSE)

  return(meets.criteria)  
} 

check.files <- sapply(file.list, check.csv) # checks if files meet criteria, as above, assumes that file.list has the whole path, which it will if your working directory is where the files are

files.to.write <- file.list[check.files] # subsets list of files to move

read.write <- function(csv.path){ # function to write csvs into new folder specified in the path as other_folder

  the.csv <- read.csv(file = csv.path, header = TRUE)

  write.csv(the.csv, file = sprintf("other_folder/%s", csv.path))
  # this other_folder must exist
} 


sapply(files.to.write, read.write) # write csvs to new folder
Scransom
  • 3,175
  • 3
  • 31
  • 51
  • Hey there, thanks a lot. This code should be exactly what I am looking for. Unfortunately I am getting two error messages at the end of the code: Error in file(file, ifelse(append, "a", "w")) : cannot open the connection AND In addition: Warning message: In file(file, ifelse(append, "a", "w")) : cannot open file 'D:/_..._final/': Permission denied -- what am I missing? – Nis Jun 12 '18 at 10:54
  • Where are these errors occurring? After the `sapply(files.to.write, read.write)` call? It might be something to do with admin permissions on the computer? Can you `write.csv` an arbitrary object like the `cars` data? – Scransom Jun 12 '18 at 12:55
  • Yes, they are occurring after the sapply(files.to.write, read.write) call. Normally I don't have any problems with write.csv and I am the admin of this laptop. – Nis Jun 12 '18 at 13:14
  • Can you run `read.write(files.to.write[1])`? And if not, try each of the commands inside the function outside of it. Not clear if the issue is with this code or something else specific to your machine. You've said you "normally" have no problems with write.csv, but can you write one of these files? – Scransom Jun 13 '18 at 02:59
  • Some possible solutions in the comments and answers to this question: https://stackoverflow.com/questions/17156445/why-i-get-this-error-writing-data-to-a-file – Scransom Jun 13 '18 at 03:08
  • 1
    I could solve the problem by changing `write.csv(the.csv, file = sprintf("D:/.../out", csv.path))` into `write.csv(the.csv, file = paste("D:/.../out/new", csv.path, sep = ""))`. Now the code works perfectly. – Nis Jun 13 '18 at 12:02
0

First create a duplicate of the folder. Then we will delete all the files that do not meet the conditions given above:

Write a function:

testfile = function(path){
dat= read.csv(path,header=T)
dat1=aggregate(.~year,dat,length)
a=nrow(d)>4
b=d$samplingsite>=4
d=d$A>-4
if(!(a&b&d)) file.remove(path)
}

mycsv=list.files(path="D:/…/in ", pattern="allyears")# Path to the duplicate folder

sapply(mycsv,testfile)
Onyambu
  • 67,392
  • 3
  • 24
  • 53