0

I just started using tensorflow-datasets and I'm a bit puzzled. I spent almost an hour googling it and I still cannot find a way to get feature names in a dataframe. I guess I'm missing something obvious.

import tensorflow_datasets as tfds

ds, ds_info = tfds.load('iris', split='train',
                        shuffle_files=True, with_info=True)
tfds.as_dataframe(ds.take(10), ds_info)

I'd like to know which feature is what: sepal_length, sepal_width, petal_length, petal_width. But I'm stuck with a single ndarray.

I can get class names:

ds_info.features["label"].names

is giving me: ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], but

ds_info.features["features"]

gives me nothing: Tensor(shape=(4,), dtype=float32)

In summary, my question: any idea how to identify input ndarray content with features names like "sepal_length", "sepal_width", "petal_length", "petal_width"?

Cyrille
  • 13,905
  • 2
  • 22
  • 41
  • It probably wouldn't make much sense to try relate which feature belongs to which (length, width). The information is encoded. For some reason if you have to know this, then use it https://www.kaggle.com/datasets/uciml/iris – Innat Apr 28 '23 at 12:56

1 Answers1

0

Here is a get started code for iris data with tensorflow dataset API.

ds, ds_info = tfds.load(
    'iris', split='train',
    shuffle_files=True, with_info=True
)
ds = ds.map(lambda x : (x['features'], x['label']))
x, y = next(iter(ds))
x.shape, y.shape
# (TensorShape([4]), TensorShape([]))
ds = ds.batch(32, drop_remainder=True)
ds = ds.prefetch(tf.data.AUTOTUNE)
x, y = next(iter(ds))
x.shape, y.shape
# (TensorShape([32, 4]), TensorShape([32]))

Building a dummy model to train.

model = keras.Sequential([
    keras.layers.Dense(16, activation='relu'),
    keras.layers.Dense(32, activation='relu'),
    keras.layers.Dense(64, activation='relu'),

    # 3 classes: https://www.tensorflow.org/datasets/catalog/iris
    keras.layers.Dense(3, activation='softmax'),
])

model.compile(
    loss='sparse_categorical_crossentropy', 
    optimizer='adam', 
    metrics=['accuracy']
)

history = model.fit(
    ds, 
)
# 3s 9ms/step - loss: 1.0857 - accuracy: 0.3047

Resource

Innat
  • 16,113
  • 6
  • 53
  • 101