2

I am wondering if there are any reasonable ways to generate clients data sets for federated learning simulation using tff core code? In the tutorial for the federated core, it uses the MNIST database with each client has only one distinct label in his data set. In this case, there are only 10 different labels available. If I want to have more clients, how can I do that? Thanks in advance.

Baymax
  • 41
  • 4

2 Answers2

1

If you want to create a dataset from scratch you can use tff.simulation.FromTensorSlicesClientData to covert tensors to tff clientdata object. Just you need to pass dictionary having client id as key and dataset as value.

client_train_dataset = collections.OrderedDict()
for i in range(1, split+1):
    client_name = "client_" + str(i)
    start = image_per_set * (i-1)
    end = image_per_set * i

    print(f"Adding data from {start} to {end} for client : {client_name}")
    data = collections.OrderedDict((('label', y_train[start:end]), ('pixels', x_train[start:end])))
    client_train_dataset[client_name] = data

train_dataset = tff.simulation.FromTensorSlicesClientData(client_train_dataset)

you can check my complete implementation here, where i have splitted mnist into 4 clients.

Mukul
  • 860
  • 8
  • 19
0

There are preprocessed simulation datasets in TFF that should serve this purpose quite nicely. Take, for example, loading EMNIST where the images are partitioned by writer, corresponding to user, as opposed to label. This can be loaded into a Python runtime rather simply (here creating train data with 100 clients):

source, _ = tff.simulation.datasets.emnist.load_data()

def map_fn(example):
  return {'x': tf.reshape(example['pixels'], [-1]), 'y': example['label']}

def client_data(n):
  ds = source.create_tf_dataset_for_client(source.client_ids[n])
  return ds.repeat(10).map(map_fn).shuffle(500).batch(20)

train_data = [client_data(n) for n in range(100)]

There are existing datasets partitioned in a similar way for extended MNIST (IE, includes handwritten characters in addition to digits), Shakespeare plays (partitioned by character), and Stackoverflow posts (partitioned by user). Documentation on these datasets can be found here.

If you wish to create your own custom dataset, please see the answer here.

Keith Rush
  • 1,360
  • 7
  • 6
  • Thx Keith, I just have time to practice this method. But it seems "train_dara" is a list of BatchDataSet? Is there any way it can be directly used in tf.keras.model.fit()? – Baymax Nov 03 '19 at 10:54
  • keras would support feeding these datasets in serially (i.e., iterating over the list and feeding them in one at a time). But any TFF federated computation with a type signature accepting values placed at clients would accept a list representing these clients. E,g, the computation retuerned by `tff.learning.build_federated_averaging_process` would accept this list of datasets as its client data argument. If you need to check on the value of a type signature, just access the `type_signature` attribute on any `tff.Computation`. Hope this makes sense! – Keith Rush Nov 04 '19 at 23:52