Imagine the filesystem of a disk drive. The contents of the files are stored in the disk, but it's the index and the pointers to these data that is called filesystem. That metadata that brings value to the user who can find the relevant data when they need them, by searching or navigating through the filesystem.
Similarly with MLMD, it stores the metadata of a ML pipeline, like which hyperparameters you've used in an execution, which version of training data, how was the distribution of the features, etc. But it's beyond being just a registry of the runs. These metadata can be used to empower two killer features of a ML pipeline tool:
- asynchronous execution of its components, for example retrain a model when there are new data, without necessary having a new vocabulary generated
- reuse results from previous runs, or step-level output caching. For example, do not run a step if its input parameters haven't changed, but reuse the output of a previous run from the cache to feed the next component.
So yes, the actual data are indeed stored in a storage, maybe a cloud bucket, in form of parquet files across transformations, or model files and schemata protobufs. And MLMD stores the uri to these data with some meta information. For example, a savedmodel is stored in s3://mymodels/1
, and it has an entry in the Artifacts
table of MLMD, with a relation to the Trainer
run and it's TrainArgs
parameters on the ContextProperty
table.
If not, what does it mean by results of pipeline components?
It means the pointers to the data which have been generated by the run of a component, including the input parameters. In our previous example, if the input data as well as the the TrainArgs
of a Trainer
component haven't changed in a run, it shouldn't run again that expensive component, but reuse the modelfile from the cache.
This requirement of a continuous ML pipeline makes the use of workflow managers such as Tekton
or Argo
more relevant compared to Airflow
, and MLMD a more focused metadata store compared to the later.