Use a loop or a function to run a script over multiple files using R

Question

Possible Duplicate:
Which is the best method to apply a script repetitively to n .csv files in R?

I have a script that calculates kernel utilization distributions and generates a PDF of the summary plots. My script currently split the database by each "period" and uses a for loop to generate the KUD plots. I also have multiple csv files store in a directory that correspond to individual fish. What I would like is to use a function or loop to read all the files in the directory and run the script that will generate PDF plots for each individual fish and each period.

My script looks like this:

library(ks)

dd<-read.table(text="period  dist   depth
            1   4916.64 8.661827
            1   4916.64 14.789091
            1   4916.64 13.555909
            1   4916.64 12.92816
            1   4916.64 11.708774
            1   4916.64 15.28
            1   4916.64 13.369875
            1   4916.64 14.039655
            1   4916.64 13.454545
            1   4916.64 12.638261
            1   4916.64 13.251081
            1   4916.64 14.006341
            1   4916.64 12.64
            1   4916.64 15.521818
            1   4916.64 10.202121
            1   4916.64 14.816667
            1   4916.64 15.504
            1   9674.844    23.93
            1   11000.151   22.157143
            1   11414.31    22.72
            1   11414.31    25.7
            1   11414.31    19.07
            1   11414.31    23.085714
            1   9481.57 17.266667
            1   11414.31    26.8
            1   11414.31    19.382222
            1   5616.09 12.016667
            1   10658.02    18.873913
            1   11414.31    25.2
            1   11414.31    20.9
            1   11414.31    27.65
            1   11414.31    22.133333
            1   11414.31    30.9
            1   5616.09 23.3
            2   11172.718   20.391667
            2   9964.755    23.51
            2   5616.09 19.43
            2   5616.09 19.1
            2   4916.64 18.42
            2   8515.2  17.683333
            2   11414.31    22.128571
            2   11414.31    22.8608
            2   10391.095   24.955882
            2   10931.125   25.225
            2   6444.407    20.228571
            2   11276.257   23.77619
            2   10585.993   23.285714
            2   10641.214   20.653333
            2   9757.676    24.007143
            2   11414.31    18.817
            2   11414.31    23.525
            2   11414.31    22.873684
            2   11414.31    26.15
            2   10486.595   21.9
            2   11000.151   24.142857
            2   11414.31    24.3875
            2   10819.621   20.569231
            2   10360.088   29.345455
            2   9708.951    21.488235
            2   11414.31    30.775
            2   11414.31    25.5
            2   11414.31    18.477917
            2   10327.144   26.8625
            2   11414.31    26.12963
            2   11414.31    29.28125
            2   11414.31    23.166667
            2   10689.532   21.8625
            2   11414.31    28.328571
            2   11414.31    22.563158
            2   11414.31    25.490909
            2   11414.31    26.0625
            2   11414.31    34.5
            2   11414.31    17.375294
            ",header=T)

dd1<-data.frame(dd$period,dd$dist,dd$depth)

# split database for each period
period<-dd1$dd.period
M<-split(dd1,period)
l<-length(M)

# run loop through each tag and create a PDF file with all KUD plots
pdf("KUD plot.pdf",width=11,height=8,paper="a4r")
par(mfcol=c(1,1))

for(j in 1:l){

# calculate the 2D kernel
dd2<-data.frame(M[[j]]$dd.dist,M[[j]]$dd.depth)

## auto bandwidth selection
H.pi2<-Hpi(dd2,binned=TRUE)*1
ddhat<-kde(dd2,H=H.pi2)

# Kernel contour plot
plot(ddhat,cont=c(95),drawpoints=TRUE,col="black",xlab="Distance (m)",lwd=2.5, 
    ylab="Depth (m)",ptcol="grey15",cex=0.7,
    xlim=c(min(dd2[,1]-dd2[,1]*0.4),max(dd2[,1]+dd2[,1]*0.4)),ylim=c(45,-1),
    main=paste("Period"," - ",M[[j]]$dd.period[1])) 

plot(ddhat,cont=c(25),add=TRUE,col="red",lwd=2.4)
plot(ddhat,cont=c(50),add=TRUE,col="seagreen2",lwd=2.4)
plot(ddhat,cont=c(75),add=TRUE,col="royalblue",lty=5,lwd=2.5)  

}

dev.off()

Any suggestions or ideas of how to improve the script will be appreciated. Thanks in advance!

Definitely duplicate. also don't see any attempt to solve by OP. — CHP, Jan 29 '13 at 08:59

score 0 · Answer 1 · answered Jan 29 '13 at 09:06

The answer @RomanLustrik linked does not contain an apply based solution, so I'll present one here. The function you'll likely want is d*ply from the plyr package. plyr does all the splitting-applying a function to each split-and combining it back together again. For example, to get the mean depth per period:

library(plyr)
ddply(dd, .(period), summarise, mn = mean(depth))

You can use the same idea to perform your analysis:

res = dlply(dd, .(period), function(x) {
   dd2 = x[-1]
   H.pi2<-Hpi(dd2,binned=TRUE)*1
   ddhat<-kde(dd2,H=H.pi2)
 })

where res is a list of length 2 with the results of the analysis. You can use a similar approach to extract the information from res, and plot it. I would use ggplot2 for that.

score 0 · Answer 2 · answered Jan 29 '13 at 09:06

One way to simplify your code is to replace the for loop with a call to ddply from the plyr package. For example :

library(plyr)

pdf("test.pdf",width=11,height=8,paper="a4r")
invisible(ddply(dd1, .(period), function(df) {
  period <- df$dd.period[1]
  df <- df[,-1]
  ## auto bandwidth selection
  H.pi2<-Hpi(df,binned=TRUE)*1
  ddhat<-kde(df,H=H.pi2)

  ## Kernel contour plot
  plot(ddhat,cont=c(95),drawpoints=TRUE,col="black",xlab="Distance (m)",lwd=2.5, 
       ylab="Depth (m)",ptcol="grey15",cex=0.7,
       xlim=c(min(df$dd.dist-df$dd.dist*0.4),max(df$dd.dist+df$dd.dist*0.4)),ylim=c(45,-1),
       main=paste("Period"," - ",period)) 

  plot(ddhat,cont=c(25),add=TRUE,col="red",lwd=2.4)
  plot(ddhat,cont=c(50),add=TRUE,col="seagreen2",lwd=2.4)
  plot(ddhat,cont=c(75),add=TRUE,col="royalblue",lty=5,lwd=2.5)  

}))
dev.off()

Then you could wrap this code into your own function wihich would take a data frame and a pdf filename as argument :

pdfks <- function(dd1, filename) {
  pdf(filename, width=11, height=8, paper="a4r")
  ...
  dev.off()
}

Then you can create a loop that will load each of your csv file in turn and apply this function to each of them.

Use a loop or a function to run a script over multiple files using R

2 Answers2