2

In my setup, I run a script that trains a model and starts generating checkpoints. Another script watches for new checkpoints and evaluates them. The scripts run in parallel, so evaluation is just a step behind training.

What's the right Tracks configuration to support this scenario?

DalyaG
  • 2,979
  • 2
  • 16
  • 19
Michael Litvin
  • 3,976
  • 1
  • 34
  • 40

2 Answers2

1

disclaimer: I'm part of the allegro.ai Trains team

Do you have two experiments? one for testing one for training ?

If you do have two experiments, then I would make sure the models are logged in both of them (which if they are stored on the same shared-folder/s3/etc will be automatic) Then you can quickly see the performance of each-one.

Another option is sharing the same experiment, then the second process adds reports to the original experiment, that means that somehow you have to pass to it the experiment id. Then you can do:

task = Task.get_task(task_id='training_task_id`)
task.get_logger().report_scalar('title', 'loss', value=0.4, iteration=1)

EDIT: Are the two processes always launched together, or is the checkpoint test a general purpose code ?

EDIT2:

Let's assume you have main script training a model. This experiment has a unique task ID:

my_uid = Task.current_task().id

Let's also assume you have a way to pass it to your second process (If this is an actual sub-process, it inherits the os environment variables so you could do os.environ['MY_TASK_ID']=my_uid)

Then in the evaluation script you could report directly into the main training Task like so:

train_task = Task.get_task(task_id=os.environ['MY_TASK_ID'])
train_task.get_logger().report_scalar('title', 'loss', value=0.4, iteration=1)
Martin.B
  • 599
  • 3
  • 9
  • 1
    The best would be that in Trains they would appear as one experiment. These are separate general purpose scripts, running in parallel. So the training script is still in Running state, and according to the docs sharing the experiment id wouldn't work... – Michael Litvin Jun 12 '20 at 05:34
  • @MichaelLitvin " and according to the docs sharing the experiment id wouldn't work... " Not sure what you mean by that, but it is supported. I edited the original answer with full explanation. The main caveat is passing the task UID to the evaluation process. But that is a technical details that can be solved easily once we understand the setup J – Martin.B Jun 22 '20 at 13:45
  • The docs say that in order for the task it to be reused, "the Task’s status is Draft, Completed, Failed, or Aborted". In my scenario, training and evaluation are two separate scripts running in parallel. Training generates model checkpoints, evaluation reads them and produces metrics. I want both these scripts to write to the same Task, but when I run evaluation the training task would be in Running state. – Michael Litvin Jun 22 '20 at 17:39
  • 1
    I see... seems like we need to rephrase the documentation a bit. " in order for the task it to be reused, ..." , the term "reused" is a bit ambiguous, what it should say is, when calling `Task.init` a new Task will be created if in the previous run no artifacts/models where created or the Task was not archived/published. Bottom line it has nothing to do with your use case. My edited reply should solve your problem, the training script creates the Task (and will later close it when it exits) and the evaluation script will report (in parallel) to the same Task. Make sense ? – Martin.B Jun 22 '20 at 21:10
1

@MichaelLitvin, We had the same issue, and also had the same names for everything we logged in train and test, since it comes from the same code (obviously). In order to avoid train/test mess in trains' plots, we modified tensorflow_bind.py to add a different prefix for "train" and "validation" streams. Trains' bugfix was adding a logdir name (which was not that clear for us).

*This was done 1-2 years ago, so it might be redundant now

Cheers, Dagan

Dagan
  • 11
  • 1
  • Hey Dagan:) Nice to hear you had a similar workflow. I ended up using the same task for train/test, and logging things manually in test, so I don't have the naming problem. – Michael Litvin Jul 09 '20 at 08:36