0

I'm trying to add Faker data type to SDV model.

Imports:

from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer
import faker

Code:

fake = faker.Faker()

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)

metadata.update_column(
    column_name='DR_Prod',
    sdtype='fake.company'
)

I also tryied to add: 'faker.providers.company', but every time gets error (kernel crash).

After metadata.detect i run this code:

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(df)
synthetic_data = synthesizer.sample(num_rows=len(df))

I can run code without metadata.update, but I don't get result, that I need.

sdv.version '1.2.0'

Thanks.

John Doe
  • 95
  • 6

2 Answers2

0

I think this would help you to get some insights

  1. Add the Faker data type to the model.

    model.add_field(name="name", type="faker.name")

change sdtype to type

  1. Update the metadata for the model.

    model = sdv.SDV()

    model.add_field(name="name", type="faker.name")

    model.update_metadata()

    data = model.generate_data(10)

    for row in data: print(row["name"])

Dejene T.
  • 973
  • 8
  • 14
0

I've figure out, how to solve this problem. There's updated code:

import pandas as pd
from faker import Faker
from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer

# fuction for generation fake first name
fake = Faker(['en_GB'])

# making fake list
names_list = []

for _ in range(5):
    names_list.append(
        fake.first_name()
    )

# making dataframe with fake names
df = pd.DataFrame(data=names_list)
df = df.rename(columns={0: 'Names'}).reset_index()

# get matadata from df
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)

# this fuctions is using to generate first names in synthetic dataframe
Faker.seed(0)

# update metadata
# just add faker.provider:
# it can be: 'company', 'first_name' or 'job'
metadata.update_column(
    column_name='Names',
    sdtype='first_name'
)

# create model with launguage needed (check localized providers on faker site)
synthesizer = GaussianCopulaSynthesizer(metadata, locales=['en_GB'])

# model fit
synthesizer.fit(df)
# generate fake dataframe
synthetic_data = synthesizer.sample(num_rows=len(df))

There's outputs( df and synthetic_data):

index Names
0 0 Emily
1 1 Simon
2 2 Garry
3 3 Denise
4 4 Danielle
index Names
0 1 Aaron
1 3 Gillian
2 1 Sheila
3 2 Julia
4 1 Caroline
John Doe
  • 95
  • 6