How to prepare data in the input format table and metadata for the Synthetic Data Vault (SDV) library

Question

I want to use the synthetic data generation method of the Synthetic Data Vault (SDV) library (reference https://sdv.dev/SDV/index.html), but I can't. I think my problem is how to prepare data in the input format required for the method ".fit()".

The demo code is as follows:

from sdv import SDV, load_demo

metadata, tables = load_demo(metadata=True)

SDV().fit(metadata, tables)

sampled = sdv.sample_all()

The object "metadata" is:

type(metadata) = <class 'sdv.metadata.dataset.Metadata'>

and the object "tables" is a dict of 3 dataframes:

type(tables) = <class 'dict'>
type(tables['users']) = <class 'pandas.core.frame.DataFrame'>

My case study begin by a Pandas dataframe:

df_input = pd.read_csv("file.csv")

so I can instantiate the "table" object as a dict:

table_input={'input':df_input}

but I am not sure how to instantiate the "metadata" object. I have tried:

from sdv  import Table  
metadata_input=Table(name='input',
                     field_names =df_input.columns.tolist(),
                     field_types = {'ID':'int64',
                                    'Type':'object',
                                    'Air temperature [K]':'float64',
                                    'Rotational speed [rpm]':'int64',
                                    },
                     primary_key = 'ID')

but this didn't work:

sdv.fit(metadata=metadata_input, 
          tables= table_input)

The error is:

TypeError: 'Table' object is not subscriptable

Finally, how should I create the metadata object?

Plamen Valentinov Kolev · Answer 1 · 2022-10-14T20:23:13.743

The SDV case in the readme is designed for a multi-table datasets. In order to use a single-table case (which would be your case), you can use a GaussianCopula model and skip some of the steps you are currently doing.

Here is an example using your dataframe:

import pandas as pd
from sdv.tabular import GaussianCopula

df_input = pd.read_csv('file.csv')

model = GaussianCopula()
model.fit(df_input)
synthetic_data = model.sample(100) # sample 100 new rows

You can refer to the documentation for advanced usage: https://sdv.dev/SDV/user_guides/single_table/gaussian_copula.html

score 0 · Answer 2 · answered Feb 22 '23 at 14:00

I had a similar problem when creating quality reports with sdmetrics and I solved by converting the Table object to dict. In your case:

sdv.fit(metadata=metadata_input.to_dict(), tables= table_input)

Btw, you can infer the metadata from the dataframe this way:

from sdv import Table
metadata_input=Table()
metadata_input.fit(table_input)

How to prepare data in the input format table and metadata for the Synthetic Data Vault (SDV) library

2 Answers2