How to create federated dataset from a CSV file?

Question

I have selected this dataset: https://www.kaggle.com/karangadiya/fifa19

Now, I would like to convert this CSV file into the federated dataset to fit in the model.

Tensorflow provided tutorials on federated learning where they have used a pre-defined dataset. However, my question is How can I use this particular dataset for a federated learning scenario?

score 7 · Answer 1 · answered Nov 22 '19 at 06:38

I'll use a different CSV dataset, but this should still address the core of this question, which is how to create a federated dataset from a CSV. Let's also assume that there is a column in that dataset which you would like to represent the client_ids for your data.

import pandas as pd
import tensorflow as tf
import tensorflow_federated as tff

csv_url = "https://docs.google.com/spreadsheets/d/1eJo2yOTVLPjcIbwe8qSQlFNpyMhYj-xVnNVUTAhwfNU/gviz/tq?tqx=out:csv"

df = pd.read_csv(csv_url, na_values=("?",))

client_id_colname = 'native.country' # the column that represents client ID
SHUFFLE_BUFFER = 1000
NUM_EPOCHS = 1

# split client id into train and test clients
client_ids = df[client_id_colname].unique()
train_client_ids = client_ids.sample(frac=0.5).tolist()
test_client_ids = [x for x in client_ids if x not in train_client_ids]

There are a few ways to do this, but the way I'll illustrate here uses tff.simulation.ClientData.from_clients_and_fn, which requires that we write a function that accepts a client_id as input and returns a tf.data.Dataset. We can easily construct this from the dataframe.

def create_tf_dataset_for_client_fn(client_id):
  # a function which takes a client_id and returns a
  # tf.data.Dataset for that client
  client_data = df[df[client_id_colname] == client_id]
  dataset = tf.data.Dataset.from_tensor_slices(client_data.to_dict('list'))
  dataset = dataset.shuffle(SHUFFLE_BUFFER).batch(1).repeat(NUM_EPOCHS)
  return dataset

Now, we can use the function above to create a ConcreteClientData object for our training and test data:

train_data = tff.simulation.ClientData.from_clients_and_fn(
        client_ids=train_client_ids,
        create_tf_dataset_for_client_fn=create_tf_dataset_for_client_fn
    )
test_data = tff.simulation.ClientData.from_clients_and_fn(
        client_ids=test_client_ids,
        create_tf_dataset_for_client_fn=create_tf_dataset_for_client_fn
    )

To see one instance of the dataset, try:

example_dataset = train_data.create_tf_dataset_for_client(
        train_data.client_ids[0]
    )
print(type(example_dataset))
example_element = iter(example_dataset).next()
print(example_element)
# <class 'tensorflow.python.data.ops.dataset_ops.RepeatDataset'>
# {'age': <tf.Tensor: shape=(1,), dtype=int32, numpy=array([37], dtype=int32)>, 'workclass': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Local-gov'], dtype=object)>, ...

Each element of example_dataset is a Python dictionary where the keys are strings representing feature names, and the values are tensors with one batch of those features. Now, you have a federated dataset that can be preprocessed and used for modeling.

Thank you so much. I will surely try it out and let you know. :) — Mahedi Hasan Jisan, Dec 01 '19 at 04:34
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in ----> 1 train_client_ids = client_ids.sample(frac=0.5) AttributeError: 'numpy.ndarray' object has no attribute 'sample' — Mahedi Hasan Jisan, Dec 03 '19 at 22:14
I am getting this error in train_client_ids line. Can you give me a solution? — Mahedi Hasan Jisan, Dec 03 '19 at 22:14
The code works perfectly fine with minor editing. To overcome this error on train_client_ids line, replace that line's code with this line(also provide the variable 'number_of_training_clients' a value): train_client_ids = sample(client_ids.tolist(),number_of_training_clients) — aksingh2411, Jul 04 '20 at 12:45
@aksingh2411 it seems that still the `sample` is not defined. can I know how did you run this code? — Reem, Feb 28 '22 at 04:54
This answer is >2yrs old. It may no longer work with the current tensorflow federated API. However, `sample` is a method from `pandas`, and this method still exists; see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html . — jpgard, Feb 28 '22 at 23:05

score 2 · Answer 2 · answered Dec 22 '19 at 21:04

You can convert your CSV file to federated data by first creating an h5 file from your CSV file.

Background An h5 file is a hierarchal file structure that shows metadata, this works well as the hierarchal structure represents federated user id's very well

When you are creating federated data you are creating using a client data object, client data is implemented using an h5 file,

Federated Source Code : Client Data https://github.com/tensorflow/federated/blob/master/tensorflow_federated/python/simulation/hdf5_client_data.py

Steps

Create your h5 file
In Federated, Experiment create a client data object , and then follow the Image Recognition tutorial on the federated main page

Creating h5 file

with h5py.File("student31.h5", 'a') as hdf:

example = hdf.create_group("examples")
for i in range(0,20):
    # for data in myDataFrame:
    #     localList.append(str(data))
    # print(type(myDataFrame))
    # data.append(myDataFrame)
    exampleGroup = example.create_group(str(i))

    # myClientGroup = hdf.create_group(str(i))
    # d1 = np.random.random(size = (100,33))
    print("printing the type ")
    print(type(train[i][0]))
    exampleGroup.create_dataset('x',data=train[i])
    exampleGroup.create_dataset('y',data=dataY[i])

Federated Client data Instantiation

    myclient = HDF5ClientData("student31.h5")

score 0 · Answer 3 · answered Mar 07 '22 at 00:25

If you're in 2022, try this update code based in @jpgard response

!pip install tensorflow-federated==0.19.0

import pandas as pd
import tensorflow as tf
import tensorflow_federated as tff

csv_url = "https://docs.google.com/spreadsheets/d/1eJo2yOTVLPjcIbwe8qSQlFNpyMhYj-xVnNVUTAhwfNU/gviz/tq?tqx=out:csv"

df = pd.read_csv(csv_url, na_values=("?",))

client_id_colname = 'native.country' # the column that represents client ID
SHUFFLE_BUFFER = 1000
NUM_EPOCHS = 1
# split client id into train and test clients
client_ids = df[client_id_colname].unique()
train_client_ids = pd.DataFrame(client_ids).sample(frac=0.5).values.tolist()
test_client_ids = [x for x in client_ids if x not in train_client_ids]

def create_tf_dataset_for_client_fn(client_id):
  # a function which takes a client_id and returns a
  # tf.data.Dataset for that client
  client_data = df[df[client_id_colname] == client_id[0]]
  dataset = tf.data.Dataset.from_tensor_slices(client_data.fillna('').to_dict("list"))
  dataset = dataset.shuffle(SHUFFLE_BUFFER).batch(1).repeat(NUM_EPOCHS)
  return dataset

train_data = tff.simulation.datasets.ClientData.from_clients_and_fn(
        client_ids=train_client_ids,
        create_tf_dataset_for_client_fn=create_tf_dataset_for_client_fn
    )
test_data = tff.simulation.datasets.ClientData.from_clients_and_fn(
        client_ids=test_client_ids,
        create_tf_dataset_for_client_fn=create_tf_dataset_for_client_fn
    )
example_dataset = train_data.create_tf_dataset_for_client(
        train_data.client_ids[0]
    )

print(type(example_dataset))
example_element = iter(example_dataset).next()
print(example_element)
# <class 'tensorflow.python.data.ops.dataset_ops.RepeatDataset'>
# {'age': <tf.Tensor: shape=(1,), dtype=int32, numpy=array([37], dtype=int32)>, 'workclass': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Local-gov'], dtype=object)>, ...

How to create federated dataset from a CSV file?

3 Answers3

Linked