Data stored in MLMD in TensorFlow TFX

Question

As far as I understand, TensorFlow uses MLMD to record and retrieve metadata associated with workflows. This may include:

results of pipeline components
metadata about artifacts generated through the components of the pipelines
metadata about executions of these components
metadata about the pipeline and associated lineage information

Features:

Does the above (e.g. #1 aka "results of components") imply that MLMD stores actual data? (e.g. input features for ML training?). If not, what does it mean by results of pipeline components?

Orchestration and pipeline history:

Also, when using TFX with e.g. AirFlow, which uses its own metastore (e.g. metadata about DAGs, their runs, and other Airflow configurations like users, roles, and connections) does MLMD store redundant information? Does it supersede it?

score 1 · Answer 1 · answered Nov 05 '21 at 15:05

Imagine the filesystem of a disk drive. The contents of the files are stored in the disk, but it's the index and the pointers to these data that is called filesystem. That metadata that brings value to the user who can find the relevant data when they need them, by searching or navigating through the filesystem.

Similarly with MLMD, it stores the metadata of a ML pipeline, like which hyperparameters you've used in an execution, which version of training data, how was the distribution of the features, etc. But it's beyond being just a registry of the runs. These metadata can be used to empower two killer features of a ML pipeline tool:

asynchronous execution of its components, for example retrain a model when there are new data, without necessary having a new vocabulary generated
reuse results from previous runs, or step-level output caching. For example, do not run a step if its input parameters haven't changed, but reuse the output of a previous run from the cache to feed the next component.

So yes, the actual data are indeed stored in a storage, maybe a cloud bucket, in form of parquet files across transformations, or model files and schemata protobufs. And MLMD stores the uri to these data with some meta information. For example, a savedmodel is stored in s3://mymodels/1, and it has an entry in the Artifacts table of MLMD, with a relation to the Trainer run and it's TrainArgs parameters on the ContextProperty table.

If not, what does it mean by results of pipeline components?

It means the pointers to the data which have been generated by the run of a component, including the input parameters. In our previous example, if the input data as well as the the TrainArgs of a Trainer component haven't changed in a run, it shouldn't run again that expensive component, but reuse the modelfile from the cache.

This requirement of a continuous ML pipeline makes the use of workflow managers such as Tekton or Argo more relevant compared to Airflow, and MLMD a more focused metadata store compared to the later.

Question about redundancy. Can multiple MLMD servers be setup on cloud all reading and writing to the same data? That is, if one goes down, we wouldn't have to wait to bring it up halting other workflows? If so, is there any documentation on this? — dustin, May 24 '22 at 06:37

score 0 · Answer 2 · answered Nov 19 '20 at 10:14

TFX is a ML pipeline/workflow so when you write a TFX application what you are doing is essentially constructing the structure of the workflow and preparing the WF to accept a particular set of data and process or use it (transformations, model build, inference, deploy etc.). So in that aspect it never stores the actual data, it stores the information (metadata) in order to process or use the data. So for example in the condition where it checks anomalies, it requires to remember the previous data schema/stats (not the actual data), so it saves that information as metadata in the MLMD; besides the actual run metadata. In terms of Airflow it will also save the run metadata. This can be seen as a subset of all the metadata, very limited in comparison to the metadata saved in MLMD. There will be a redundancy involved though. And the controller is TFX that defines and makes use of the underlining Airflow orchestration. It will not supersede but it will definitely fail if there is a clash.

Data stored in MLMD in TensorFlow TFX

2 Answers2