How do I checkpoint only the best model from a ray tune run?

Question

NOTE: To some extent, this was already asked here but my question tackles a different aspect of getting the best checkpoint.

In the referenced question, the author only desired to retrieve the best checkpoint from a set of checkpoints after the ray tune run. I want to ensure that only the best checkpoint is saved in the first place. So basically, I am looking for something like:

At this position, the ray checkpointing callback would be triggered. Check, if the current model state is better than the current "best checkpoint". If so, then delete the old "best checkpoint" and replace it by checkpointing the current model state. If not, don't trigger the checkpointing callback.

The reason for that is that I am testing hundreds of large models simultaneously and I have to save disk memory.

Old question, but have you found an answer to this? Because I am running into the same issue right now. — shenflow, Mar 01 '23 at 09:14
I had other issues with ray tune and luckily managed to get "sufficient" results by improving the general modeling approach. So, I actually more or less got rid of ray tune entirely and unfortunately didn't solve the issue :/ — c0mr4t, Mar 02 '23 at 22:15
I am pretty sure that at this time there was no direct way of doing that. You could investigate rays docs if they introduced anything new but I assume you already did that. I will respond to this question with an idea that MIGHT work. — c0mr4t, Mar 02 '23 at 22:18

c0mr4t · Answer 1 · 2023-03-02T23:05:44.280

I didn't solve the issue as the need was no longer present at a later point in time. But for all who run into a similar issue, here is a suggestion that MIGHT work:

You have basically two options. Either interfere with RayTune's main process or control the models in its child processes directly. I think, messing with RayTune's main process is more complicated, so I'd go with the subprocesses.

During training, Ray is logging its progress and model results into files. You could check into which exact files Ray is logging these model results. Afterward, you remove all checkpointing mechanisms that existed so far in your project. You then introduce a custom checkpoint callback in the training function of your model. This custom callback checks the model results files and ONLY if it actually performed the best, the model is checkpointed to a central folder in your project (and eventually overrides a previous best).

Issues you might run into:

How can a subprocess identify itself? So basically if ray tune says "model 3 is currently best"... how does the subprocess know that it's model 3?

I am sure that there are multiple ways to deal with this issue (the most obvious way to differentiate between models would be the ray tune params that are set in models).
How can you be sure that the model result files are always up to date?

If files are not flushed properly, it might happen that you only get the second or third best model. I don't think that really matters with hundreds of models but if you want the absolute best, that is something you should be aware of.

How do I checkpoint only the best model from a ray tune run?

1 Answers1