-2

This is not homework, just a learning exercise on my part. I have been running a very simple simulation (or number-crank) in R. It generates two numbers (A, B) and runs for 1 month.

A=NULL
B=NULL
x=NULL
x <- Sys.time()
duration <-  2592000 # 30 days
while(Sys.time() <= x + duration){
A <-append(A, sample(1:5, 1000, 1/5))
B <-append(B, sample(1:5, 1000, 1/5))
save.image()
}

I thought it was going well, but after one week (and several million numbers generated) the OS killed the process. Is there a better way of writing or running the simulation that would prevent the OS killing it?

I would prefer to rewrite the simulation than to adapt the OS (such as adding more swap etc). I am running the simulation on a low-powered device (Raspberry Pi) and am limited in what I can do on the hardware side. Thanks.

UPDATE:
1) It is not important that the samples be generated 1000 at a time. This was just my cludge.
2) It is important that the simulation runs for a set period of time ie 1 week, 1 month or 1 year.
3) Unless impossible I want the raw data.

Frank Zafka
  • 829
  • 9
  • 30
  • 2
    My guess would be that it is overflowing your memory. – Thomas Jungblut Jul 09 '12 at 09:02
  • I believe this is what I warned you against in your previous question on the same topic: http://stackoverflow.com/a/11158100/602276 – Andrie Jul 09 '12 at 09:08
  • Well I can run it for a week (x4) and not ask the question, or I can ask the question and get it to do what I want? I thought it was a reasonable question to ask. – Frank Zafka Jul 09 '12 at 09:11
  • 1
    This is hard for me to reproduce. :) Was there any error messages when the process was killed? Did you monitor memory usage? Wouldn't it also make sense to write to a single (txt) file? – Roman Luštrik Jul 09 '12 at 09:14
  • It was a linux, not R message. I had no indication of failure (memory and load was not excessive and the R data file was also not excessively large). I am not an expert at R or programming (I am a social scientist) and this is a learning experience. If you have a better way of doing this simple task, please offer a suggestion. – Frank Zafka Jul 09 '12 at 09:16
  • 1
    My guess would be that by appending larger and larger vector, your system didn't like *something*. I would approach this by generating a large vector, then write (append = TRUE) this to a single text file or a data base. – Roman Luštrik Jul 09 '12 at 09:34
  • @RomanLuštrik : that clearly seems like the right solution to create the data but wouldn't he have the same problem again when trying to load the files into R after completion of the loop? I'm guessing that the length of the generated vectors is close to 1e9 ... – plannapus Jul 09 '12 at 09:51
  • 1
    My suggestion is that you summarize your data in each loop, then save the summary to disk. This should prevent the problem of running out of addressable space (either memory or disk). – Andrie Jul 09 '12 at 09:57
  • Unless impossible I want the raw data, – Frank Zafka Jul 09 '12 at 09:58
  • 1
    @plannapus that's another department's problem then. :) One way to analyze big data is to sample it down into manageable pieces. If you have a representative sample, a few 1000 points will be just as good as a billion. If one stores this into a data base, there are plenty of tools in R to handle these sort of computations. – Roman Luštrik Jul 09 '12 at 10:01
  • did you check the size of your saved image? Did you simply fill your entire disk space? It may be worth calculating at the beginning how much disk space this will take and make sure you have appropriate resources available...I have to imagine that will get really large really fast...too large to be usable for anything really - but that's something for you to figure out :) – Chase Jul 09 '12 at 12:52
  • I did check. As stated above, both memory, system load and R save size were reasonable and I was not predicting to run out of disk space at any point. – Frank Zafka Jul 09 '12 at 12:54
  • I just did a tiny bit of very unscientific timing, and it looks like it takes around 1 microsecond to generate a random number. If we assume that the only thing we're storing is that number, and we continue doing this for a month, than we're looking at around 10^12 numbers. Storing these as a 4-byte integer means 10 TB of data. Granted, yours might be a bit slower as your saving to disk on every loop, but I'd be ready for that sort of memory and disk requirement. – Haldean Brown Jul 14 '12 at 14:33

2 Answers2

1

If the goal is to create two big samples, consider the following:

N <- 2000000
A <- sample(1:5, N, 1/5)
B <- sample(1:5, N, 1/5)
save.image()

If it is important that the samples be formed 1000 at a time for A and B alternatively, consider this:

N <- 2000
n <- 1000
A.list <- vector("list", N)
B.list <- vector("list", N)
for (i in 1:N) {
   A.list[[i]] <- sample(1:5, n, 1/5) 
   B.list[[i]] <- sample(1:5, n, 1/5)
}
A <- unlist(A.list)
B <- unlist(B.list)
save.image()

This should take care of the two main issues in your code:

  • every time you use append inside your loop, R has to create and fill a couple new objects from scratch. As the objects become larger, your loop iterations becomes slower and slower; computation times grow quadratically I believe. You also run the risk of fragmenting your memory space, this is harder to explain but you can try to research it. By using a list, only the new data from each iteration needs to be stored to memory and the computation time per loop remains the same.
  • I have moved save.image() outside of the loop. Same idea, saving objects as they get bigger and bigger will take longer and longer, i.e. slow down your iterations. Since you only care for the final vector, it makes sense to only save when you are done.

You can play with the value of N to see how far your OS will let you go. The advantage is that you don't have to wait for a week or a month to find out what the limits are.

flodel
  • 87,577
  • 21
  • 185
  • 223
  • For philosophical reasons I wish to run the simulation for 1 month. That is an important part of the simulation. It may appear a strange request, but there it is. Thanks for responding with an idea though. – Frank Zafka Jul 09 '12 at 11:06
  • 3
    you could put a `Sys.sleep(2592000 / N)` in the middle of my loop. Here you go. – flodel Jul 09 '12 at 11:34
  • 1
    Erm. Isn't that cheating? I want to generate A and B for a month. Not take a month to generate a large set number. Subtle difference? – Frank Zafka Jul 09 '12 at 11:35
  • 4
    OK, it may be early Monday AM, but what does "philosophical" have to do with reality? What are you really trying to do? Test whether `sample` operates differently under different moon phases? What you have shown us is not a "simulation" but rather a number crank. There are far simpler ways to find the various limits an OS puts on RAM, file size, etc. Perhaps if you can explain what you are actually trying to find out, we can provide the proper methodology. – Carl Witthoft Jul 09 '12 at 12:25
  • It is a "simulation" to provide me with raw data (A and B). Why I want it is because I am interested in achieving this. It worked for a week. I am looking for help in getting it working beyond that time-frame. As stated in the question. – Frank Zafka Jul 09 '12 at 12:46
  • If you want an idea about why I am doing this see: http://www.springerlink.com/content/e8322vu4035118p8/ – Frank Zafka Jul 09 '12 at 13:05
  • To be clear. Flodel's answer does not address my requirements as specified and thus does not improve on the method outlined in the question. – Frank Zafka Jul 09 '12 at 13:38
  • 3
    You are trying to create an ever growing data set in memory: you are doomed to fail as you reach R's allocated memory. So if you want your program to run fine for a month, i.e. not reach that limit, you need to write _less_ efficient code. I know, that sounds crazy, but you've done a pretty good job so far, so keep up the good work. You could for example write your own `append` and `sample` functions and make them as inefficient as you possibly can. You could even ask for help from people who don't know anything about programming. Hope that helps. – flodel Jul 09 '12 at 14:36
  • @flodel I detect a hint of sarcasm. I am not looking to sample a fixed N, but generate as many AB samples as possible in the one month. They can be printed out on paper as far as I am concerned. Though this would be a last resort. Is it really so hard to generate two random numbers between 1-5 (AB) and write that to disk (and for that process to loop for one month)? – Frank Zafka Jul 09 '12 at 14:51
1

If you consider printing out the result on paper as an acceptable solution, then Roman Luštrik's solution (in the comments to your question) of appending your data to a text-file or a database is definitely one good solution.

Here is what appending to a text-file would look like:

x <- Sys.time()
duration <-  2592000
while(Sys.time() <= x + duration){
    write.table(sample(1:5, 1000, 1/5),file="A.txt",append=TRUE,row.names=FALSE,col.names=FALSE,sep="\t")
    write.table(sample(1:5, 1000, 1/5),file="B.txt",append=TRUE,row.names=FALSE,col.names=FALSE,sep="\t")
}
plannapus
  • 18,529
  • 4
  • 72
  • 94
  • Thanks. Do you think that this will run for the month, as opposed to the version posted in the question that did not? I am happy to test it out. – Frank Zafka Jul 09 '12 at 15:02
  • Well I'm not gonna try it but feel free to do so. The text-files are gonna be pretty massive so be sure you have enough disk space. Otherwise since you're not actually keeping an ever-growing object, I would think memory (or floating point) won't be a problem.... – plannapus Jul 09 '12 at 15:07
  • 2
    It's too efficient, he's going to run out of disk (not memory this time) space. – flodel Jul 09 '12 at 15:16
  • You'll probably need them: after 1min both files were already 27Mb big. – plannapus Jul 09 '12 at 15:23
  • I've got about 28Mb in 11 mins here. Funny, my original file (from the question) was only 12Mb after the week. – Frank Zafka Jul 09 '12 at 15:29
  • 1
    Clearly he needs to zip his data at the highest compression ratio available to avoid overwriting the kernel on his hard drive Or better yet, just store the histogram of his five possible output values – Carl Witthoft Jul 09 '12 at 16:58
  • This was a whimsical project and I'm obviously not going to be able to deal with over a terrabyte of data without resorting to summaries. And that's much easier. – Frank Zafka Jul 09 '12 at 19:49