4

(I have posted the question on https://github.com/tensorflow/federated/issues/793 and maybe also here!)

I have customized my own data and model to federated interfaces and the training converged. But I am confused about an issue that in an images classification task, the whole dataset is extreme large and it can't be stored in a single federated_train_data nor be imported to memory for one time. So I need to load the dataset from the hard disk in batches to memory real-timely and use Keras model.fit_generator instead of model.fit during training, the approach people use to deal with large data.

I suppose in iterative_process shown in image classification tutorial, the model is fitted on a fixed set of data. Is there any way to adjust the code to let it fit to a data generator?I have looked into the source codes but still quite confused. Would be incredibly grateful for any hints.

Eduardo Yáñez Parareda
  • 9,126
  • 4
  • 37
  • 50
miaoz18
  • 41
  • 2

1 Answers1

3

Generally, TFF considers the feeding of data to be part of the "Python driver loop", which is a helpful distinction to make when writing TFF code.

In fact, when writing TFF, there are generally three levels at which one may be writing:

  1. TensorFlow defining local processing (IE, processing that will happen on the clients, or on the server, or in the aggregators, or at any other placement one may want, but only a single placement.
  2. Native TFF defining the way data is communicated across placements. For example, writing tff.federated_sum inside of a tff.federated_computation decorator; writing this line declares "this data is moved from clients to server, and aggregated via the sum operator".
  3. Python "driving" the TFF loop, e.g. running a single round. It is the job of this final level to do what a "real" federated learning runtime would do; one example here would be selecting the clients for a given round.

If this breakdown is kept in mind, using a generator or some other lazy-evaluation-style construct to feed data in to a federated computation becomes relatively simple; it is just done at the Python level.

One way this could be done is via the create_tf_dataset_for_client method on the ClientData object; as you loop over rounds, your Python code can select from the list of client_ids, then you can instantiate a new list of tf.data.Datasetsand pass them in as your new set of client data. An example of this relatively simple usage would be here, and a more advanced usage (involving defining a custom client_datasets_fn which takes client_id as a parameter, and passing it to a separately-defined training loop would be here, in the code associated to this paper.

One final note: instantiating a tf.data.Dataset does not actually load the dataset into memory; the dataset is only loaded in when it is iterated over. One helpful tip I have received from the lead author of tf.data.Dataset is to think of tf.data.Dataset more as a "dataset recipe" than a literal instantiation of the dataset itself. It has been suggested that perhaps a better name would have been DataSource for this construct; hopefully that may help the mental model on what is actually happening. Similarly, using the tff.simulation.ClientData object generally shouldn't really load anything into memory until it is iterated over in training on the clients; this should make some nuances around managing dataset memory simpler.

Keith Rush
  • 1,360
  • 7
  • 6
  • 1
    Thanks for your detailed instruction and I understand how to sample client ids now. The problem I am facing to use generator is that `Tensorflow.from_generator` reads data from disk batch by batch. I want to use it but I found that the client data in each round generated by `create_tf_dataset_for_client` is the whole dataset (or dataset recipe) for that round. If I want to use a batch data generator, I may need to change the model fit function to let it drive the batch iterator/generator, but I can't touch the client model fit function. – miaoz18 Jan 25 '20 at 09:18
  • Hello @miaoz18 I am working on the same problem. Did you manage to find a solution (using the batch generator for TFL)? – Arman Sep 22 '21 at 23:08
  • Hi @miaoz18 . Did you find a solution to your question? I am having the same issue wherein I can't afford to keep all the training client data in a single array since the data is huge. – ChaoS Adm Jun 12 '22 at 18:13