0

I've been trying to increase the speed of some code. I've removed all loops, am using vectors and have streamed lined just about everything. I've timed each iteration of my code and it appears to be slowing as iterations increase.

### The beginning iterations
   user  system elapsed 
   0.03    0.00    0.03 
   user  system elapsed 
   0.03    0.00    0.04 
   user  system elapsed 
   0.03    0.00    0.03 
   user  system elapsed 
   0.04    0.00    0.05 

### The ending iterations
   user  system elapsed 
   3.06    0.08    3.14 
   user  system elapsed 
   3.10    0.05    3.15 
   user  system elapsed 
   3.08    0.06    3.15 
   user  system elapsed 
   3.30    0.06    3.37 

I have 598 iterations and right now it takes about 10 minutes. I'd like to speed things up. Here's how my code looks. You'll need the RColorBrewer and fields packages. Here's my data. Yes I know its big, make sure you download the zip file.

    StreamFlux <- function(data,NoR,NTS){
###Read in data to display points###
       WLX = c(8,19,29,20,13,20,21)
       WLY = c(25,28,25,21,17,14,12)
       WLY = 34 - WLY
       WLX = WLX / 44
       WLY = WLY / 33
       timedata = NULL
       mf <- function(i){
       b = (NoR+8) * (i-1) + 8

          ###I read in data one section at a time to avoid headers
          mydata = read.table(data,skip=b,nrows=NoR, header=FALSE)
          rows = 34-mydata[,2]
          cols = 45-mydata[,3]
          flows = mydata[,7]
          rows = as.numeric(rows)
          cols = as.numeric(cols)
          rm(mydata)

          ###Create Flux matrix
          flow_mat <- matrix(0,44,33)

          ###Populate matrix###
          flow_mat[(rows - 1) * 44 + (45-cols)] <- flows+flow_mat[(rows - 1) * 44 + (45-cols)]
          flow_mat[flow_mat == 0] <- NA
          rm(flows)
          rm(rows)
          rm(cols)
          timestep = i

          ###Specifying jpeg info###
          jpeg(paste("Steamflow", timestep, ".jpg",sep = ''),
               width = 640, height=441,quality=75,bg="grey")
          image.plot(flow_mat, zlim=c(-1,1), 
                     col=brewer.pal(11, "RdBu"),yaxt="n",
                     xaxt="n", main=paste("Stress Period ", 
                     timestep, sep = ""))
          points(WLX,WLY)
          dev.off()
          rm(flow_mat)
   }
   ST<- function(x){functiontime=system.time(mf(x))
   print(functiontime)}
   lapply(1:NTS, ST)
}

This is how to run the function

###To run all timesteps###
StreamFlux("stream_out.txt",687,598)
###To run the first 100 timesteps###
StreamFlux("stream_out.txt",687,100)
###The first 200 timesteps###
StreamFlux("stream_out.txt",687,200)

To test remove print(functiontime) to stop it printing at every timestep then

> system.time(StreamFlux("stream_out.txt",687,100))
  user  system elapsed 
  28.22    1.06   32.67 
> system.time(StreamFlux("stream_out.txt",687,200))
   user  system elapsed 
 102.61    2.98  106.20 

What I'm looking for is anyway to speed up running this code and possibly an explanation of why it is slowing down? Should I just run it in parts, seems a stupid solution. I've read about dlply from the plyr. It seems to have worked here but would that help in my case? How about parallel processing, I think I can figure that out but is it worth the trouble in this case?

Community
  • 1
  • 1
CCurtis
  • 1,770
  • 3
  • 15
  • 25
  • 2
    do some work yourself first and distill your problem to a **minimal** example where you're having problems – eddi Nov 21 '13 at 21:08
  • 1
    I'm not going to spend any time on an example this large and complicated. But give you two pieces of advice: (1) Use `RProf` to determine _exactly_ which pieces are the bottleneck, (2) I'll bet that your strategy of read successive portions from disk is very sub-optimal. Touching the (presumably spinning) disk is slow. Read all the data in at once. Even then, you're trying to write several hundred plots to disk. That also won't be screaming fast, depending on your hard drive. – joran Nov 21 '13 at 21:17
  • @eddi I've been working to improve this code for some time now. If I knew what the source of my problems was I would have posted a minimal example as I have done for other sections of my code when I was in the process of streamlining things. I'm not sure where the problem is now so i wanted to give everyone a look at the full code. What is so complicated about what I posted? All I'm doing is formatting data, plotting it and repeating with `lapply`. The function ST isn't even needed. Its just there to print run times. – CCurtis Nov 21 '13 at 21:29
  • 1
    From a quick glance, I agree with @joran about (2). Your loop is probably slowing down due to this part of the code: `read.table(data, skip=b, nrows=NoR, header=FALSE)`. In particular, methinks the `skip=b` part is the culprit. You should read in all the data at the beginning, if possible, and then retrieve the necessary parts from memory for the calculations. – ialm Nov 21 '13 at 22:20
  • @ ialm Thanks for the feed back. Will give that a try. There are some formatting issues with reading the file in all at once but giving it some thought I have a way around that. – CCurtis Nov 21 '13 at 22:45
  • @ ialm Just ran the code again with the changes you and joran suggested. It was vastly faster. Elapsed run time of 104.14 vs 10 to 12 minutes. Most of the run time is it loading the data. I'm using `read.fwf()`. If I had a faster way of loading data run time would be faster yet. Thanks again for your help. – CCurtis Nov 22 '13 at 00:20
  • 1
    @ialm could you post that information you gave as an answer. That way it is clear that the question was answered, and you'll get some reputation. – Paul Hiemstra Nov 22 '13 at 05:50

1 Answers1

3

I will follow @PaulHiemstra's suggestion and post my comment as an answer. Who can resist Internet points? ;)

From a quick glance at your code, I agree with @joran's second point in his comment: your loop/function is probably slowing down due to repeatedly reading in your data. More specifically, this part of the code probably needs to be fixed:

read.table(data, skip=b, nrows=NoR, header=FALSE).

In particular, I think the skip=b argument is the culprit. You should read in all the data at the beginning, if possible, and then retrieve the necessary parts from memory for the calculations.

ialm
  • 8,510
  • 4
  • 36
  • 48
  • @ ialm Glad you posed an answer. I've done some further tweaking and have gotten things even faster. Starting elapsed time was 1014.49. Reading data in all at once using `read.fwf` run time goes down to 104.14 most of which was loading time. I was able to get `read.table` to read my data correctly setting `fill=TRUE` and `skip=8`. Run time is now 19.78. Apparently `read.table` is much faster than `read.fwf`. How's that for improvement!!! Over 50 times faster. As far as run times for each iteration of the loop, the times are very small but I observe no nonlinearity. Much thanks and cheers. – CCurtis Nov 22 '13 at 08:57