I am testing some algorithms in TensorFlow Federated (TFF). In this regard, I would like to test and compare them on the same federated dataset with different "levels" of data heterogeneity, i.e. non-IIDness.
Hence, I would like to know whether there is any way to control and tune the "level" of non-IIDness in a specific federated dataset, in an automatic or semi-automatic fashion, e.g. by means of TFF APIs or just traditional TF API (maybe inside the Dataset utils).
To be more practical: for instance, the EMNIST federated dataset provided by TFF has 3383 clients with each one of them having their handwritten characters. However, these local dataset seems to be quite balanced in terms of number of local examples and in terms of represented classes (all classes are, more or less, represented locally). If I would like to have a federated dataset (e.g., starting by the TFF's EMNIST one) that is:
- Patologically non-IID, for example having clients that hold only one class out of N classes (always referring to a classification task). Is this the purpose of
tff.simulation.datasets.build_single_label_dataset
documentation here. If so, how should I use it from a federated dataset such as the ones already provided by TFF?; - Unbalanced in terms of the amount of local examples (e.g., one client has 10 examples, another one has 100 examples);
- Both the possibilities;
how should I proceed inside the TFF framework to prepare a federated dataset with those characteristics?
Should I do all the stuff by hand? Or do some of you have some advices to automate this process?
An additional question: in this paper "Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification", by Hsu et al., they exploit the Dirichlet distribution to synthesize a population of non-identical clients, and they use a concentration parameter to control the identicalness among clients. This seems an wasy-to-tune way to produce datasets with different levels of heterogeneity. Any advice about how to implement this strategy (or a similar one) inside the TFF framework, or just in TensorFlow (Python) considering a simple dataset such as the EMNIST, would be very useful too.
Thank you a lot.