9

I would like to use the Python Faker library to generate 500 lines of data, however I get repeated data using the code I came up with below. Can you please point out where I'm going wrong. I believe it has something to do with the for loop. Thanks in advance:

from faker import Factory
import pandas as pd
import random

def create_fake_stuff(fake):


df = pd.DataFrame(columns=('name'
    , 'email'
    , 'bs'
    , 'address'
    , 'city'
    , 'state'
    , 'date_time'
    , 'paragraph'
    , 'Conrad'
    ,'randomdata'))

stuff = [fake.name()
    , fake.email()
    , fake.bs()
    , fake.address()
    , fake.city()
    , fake.state()
    , fake.date_time()
    , fake.paragraph()
    , fake.catch_phrase()
    , random.randint(1000,2000)]

for i in range(10):
        df.loc[i] = [item for item in stuff]
print(df)

if __name__ == '__main__':
    fake = Factory.create()
    create_fake_stuff(fake)
Conrad Addo
  • 424
  • 1
  • 5
  • 15

4 Answers4

10

Disclaimer: this answer is added much after the question and adds some new info not directly answering the question.

Now there is a fast new library Mimesis - Fake Data Generator.

  • Upside: It is stated it works times faster than faker (see below my test of data similar to one in question).
  • Downside: works from 3.6 version of Python only.

pip install mimesis

>>> from mimesis import Person
>>> from mimesis.enums import Gender
>>> person = Person('en')

>>> person.full_name(gender=Gender.FEMALE)
'Antonetta Garrison'
>>> personru = Person('ru')
>>> personru.full_name()
'Рената Черкасова'

The same with developed earlier faker:

pip install faker

>>> from faker import Faker
>>> fake_ru=Faker('ja_JP')
>>> fake_ru=Faker('ru_RU')
>>> fake_jp=Faker('ja_JP')
>>> print (fake_ru.name())
Субботина Елена Наумовна
>>> print (fake_jp.name())
大垣 花子

Below it my recent timing of Mimesis vs. Faker based on code provided in answer from forzer0eight:

from faker import Faker
import pandas as pd
import random
fake = Faker()
def create_rows_faker(num=1):
    output = [{"name":fake.name(),
                   "address":fake.address(),
                   "name":fake.name(),
                   "email":fake.email(),
                   #"bs":fake.bs(),
                   "city":fake.city(),
                   "state":fake.state(),
                   "date_time":fake.date_time(),
                   #"paragraph":fake.paragraph(),
                   #"Conrad":fake.catch_phrase(),
                   "randomdata":random.randint(1000,2000)} for x in range(num)]
    return output

%%time
df_faker = pd.DataFrame(create_rows_faker(5000))

CPU times: user 3.51 s, sys: 2.86 ms, total: 3.51 s Wall time: 3.51 s

from mimesis import Person
from mimesis import Address
from mimesis.enums import Gender
from mimesis import Datetime
person = Person('en')
import pandas as pd
import random
person = Person()
addess = Address()
datetime = Datetime()
def create_rows_mimesis(num=1):
    output = [{"name":person.full_name(gender=Gender.FEMALE),
                   "address":addess.address(),
                   "name":person.name(),
                   "email":person.email(),
                   #"bs":person.bs(),
                   "city":addess.city(),
                   "state":addess.state(),
                   "date_time":datetime.datetime(),
                   #"paragraph":person.paragraph(),
                   #"Conrad":person.catch_phrase(),
                   "randomdata":random.randint(1000,2000)} for x in range(num)]
    return output

%%time
df_mimesis = pd.DataFrame(create_rows_mimesis(5000))

CPU times: user 178 ms, sys: 1.7 ms, total: 180 ms Wall time: 179 ms

Below is resulting data for comparison:

df_faker.head(2)
address city    date_time   email   name    randomdata  state
0   3818 Goodwin Haven\nBrocktown, GA 06168 Valdezport  2004-10-18 20:35:52 joseph81@gomez-beltran.info Deborah Garcia  1218    Oklahoma
1   2568 Gonzales Field\nRichardhaven, NC 79149 West Rachel 1985-02-03 00:33:00 lbeck@wang.com  Barbara Pineda  1536    Tennessee

df_mimesis.head(2)
address city    date_time   email   name    randomdata  state
0   351 Nobles Viaduct  Cedar Falls 2013-08-22 08:20:25.288883  chemotherapeutics1964@gmail.com Ernest  1673    Georgia
1   517 Williams Hill   Malden  2008-01-26 18:12:01.654995  biochemical1972@yandex.com  Jonathan    1845    North Dakota
FelixEnescu
  • 4,664
  • 2
  • 33
  • 34
Alex Martian
  • 3,423
  • 7
  • 36
  • 71
  • 2
    To be honest, your benchmark is slightly off, if you run both in a loop, you will get very close numbers, (in my case it is 7 and 8 sec). – Eugene K Jul 30 '20 at 22:09
  • mimesis works only with pytz in datetime providers, so if your project have no pytz module, mimesis is not good solution – Dmitriy Lunev Jun 22 '23 at 13:33
7

Following scripts can remarkably enhance the pandas performance.

    from faker import Faker
    import pandas as pd
    import random
    fake = Faker()
    def create_rows(num=1):
        output = [{"name":fake.name(),
                   "address":fake.address(),
                   "name":fake.name(),
                   "email":fake.email(),
                   "bs":fake.bs(),
                   "address":fake.address(),
                   "city":fake.city(),
                   "state":fake.state(),
                   "date_time":fake.date_time(),
                   "paragraph":fake.paragraph(),
                   "Conrad":fake.catch_phrase(),
                   "randomdata":random.randint(1000,2000)} for x in range(num)]
        return output

It takes 5.55s.

    %%time
    df = pd.DataFrame(create_rows(5000))

    Wall time: 5.55 s
huang06
  • 147
  • 2
  • 11
2

I placed the fake stuff array inside my for loop to achieve the desired result:

for i in range(10):
    stuff = [fake.name()
        , fake.email()
        , fake.bs()
        , fake.address()
        , fake.city()
        , fake.state()
        , fake.date_time()
        , fake.paragraph()
        , fake.catch_phrase()
        , random.randint(1000, 2000)]
    df.loc[i] = [item for item in stuff]
    print(df)
Conrad Addo
  • 424
  • 1
  • 5
  • 15
2

Using the farsante and mimesis libraries is the easiest way to create Pandas DataFrames with fake data.

import random
import farsante
from mimesis import Person
from mimesis import Address
from mimesis import Datetime

person = Person()
address = Address()
datetime = Datetime()
def rand_int(min_int, max_int):
    def some_rand_int():
        return random.randint(min_int, max_int)
    return some_rand_int
df = farsante.pandas_df([
    person.full_name,
    address.address,
    person.name,
    person.email,
    address.city,
    address.state,
    datetime.datetime,
    rand_int(1000, 2000)], 5)

print(df)
        full_name              address    name  ...          state                   datetime some_rand_int
0   Weldon Durham   1027 Nellie Square   Bruna  ...  West Virginia 2030-06-10 09:21:29.179412          1453
1     Veta Conrad  932 Cragmont Arcade  Betsey  ...           Iowa 2017-08-11 23:50:27.479281          1909
2     Vena Kinney    355 Edgar Highway   Tyson  ...  New Hampshire 2002-12-21 05:26:45.723531          1735
3   Adam Sheppard    270 Williar Court  Treena  ...   North Dakota 2011-03-30 19:16:29.015598          1503
4  Penney Allison     592 Oakdale Road    Chas  ...          Maine 2009-12-14 16:31:37.714933          1175

This approach keeps your code clean.

Powers
  • 18,150
  • 10
  • 103
  • 108