-2

My requirement is to generate a csv file with ~500,000 (unique) records which has the following column headers:

csv file example:
    email,customerId,firstName,lastName
    qa+d43e5efc-6b0f-46ce-a14e-1db63bb77882@example.com,0d981ae1be954ea7-b411-28a98e3ddba2,Daniel,Newton

I tried to write below piece of code for this but wanted to know that is there a better/efficient way to do this Its my first time dealing with a large data set and currently my code takes a really long time to run [more than an hour] so really looking for suggestions/feedback. Thanks

    with open('test_csv_file.csv', 'w') as csvf:
    writer = csv.writer(csvf)
    column_headers = ("email", "customerId", "firstName", "lastName")
    writer.writerow(column_headers)
    for _ in range(500000):
        fake = Faker()
        row = (f'qa+{uuid4()}@example.com', uuid4(), fake.first_name(), fake.last_name())
        writer.writerow(row)
philomath
  • 113
  • 2
  • 9
  • How long does it take if you remove the last line? – Kelly Bundy Aug 11 '21 at 17:22
  • 2
    See [How can you profile a Python script?](https://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script) although you will may want a [line-profiler](https://pypi.org/project/line-profiler/) for this. – martineau Aug 11 '21 at 17:26
  • 2
    Do you really need 500K *different* `Faker`s? – martineau Aug 11 '21 at 17:31
  • Yes I need unique values for each row – philomath Aug 11 '21 at 17:31
  • I can think of an approach where i generate 500k rows in a different step and then write them to csv but not sure and would need advice. Thanks – philomath Aug 11 '21 at 17:32
  • 3
    That's exactly why I was suggesting that you profile your code. Also see [Which is faster?](https://ericlippert.com/2012/12/17/performance-rant/) – martineau Aug 11 '21 at 17:33
  • @martineau Will do. Thanks – philomath Aug 11 '21 at 17:35
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/235902/discussion-between-philomath-and-martineau). – philomath Aug 11 '21 at 17:35
  • 1
    See my ending comments about performance in the discussion under [this unrelated question](https://stackoverflow.com/a/64991758/355230). – martineau Aug 11 '21 at 17:46
  • Writing this amount of data to the disk will hugely dominate over any code inefficiencies. You are limited by the disk's bandwidth. – tripleee Aug 11 '21 at 18:05
  • @tripleee So how do I achieve my requirement of creating a csv file with 500k unique records ? – philomath Aug 11 '21 at 18:07
  • 1
    Your indentation is clearly incorrect; please [edit] to fix it. On the desktop version of this site, paste your code, then select the pasted block and type ctrl-K to correctly format your code with indentations preserved. – tripleee Aug 11 '21 at 18:07
  • 2
    What I'm trying to say is that your code is fine, you might be able to reduce the amount of CPU it requires (we can't see how `Faker` is defined but maybe it contains something you could optimize?) but writing this amount of data to the disk takes as long as the disk takes to copy this amount of stuff onto the media, regardless of how quick you manage to make your code. Simply copying 500k times your average line length takes time. – tripleee Aug 11 '21 at 18:10
  • 1
    @tripleee What's that amount of data that you're talking about? I only see the "500k records", not how large each record is. Am I overlooking something or were comments deleted? And their answer sounds like they dropped the time from more than an hour to just 10 seconds with different code but presumably on the same disk, so how were they "limited by the disk's bandwidth"? – Kelly Bundy Aug 12 '21 at 11:09
  • 1
    Nothing in the code they posted here should take that long. If `Faker` is that inefficient, that might be worth investigating; but that's unreproducible without more details. The calculation 500k times average line (record) length should probably translate to a few dozen megabytes, which is in and of itself not too much by today's standards. – tripleee Aug 12 '21 at 11:18
  • 1
    1)you could run profiler and see what takes time, 2) you might have a bottleneck since you are doing a class and formatting the string before putting into the file, 3)is saving the file each line? 4)you can also test how much time would take if you just do that into a normal file insetead the csv library, "with (fake_file.txt, "w") as fake_file:" .... one library might be faster than other, last year i had something similar, i try make if faster in Go and and ended up been 4 time slower. – pelos Aug 12 '21 at 18:33

1 Answers1

0
  • So I was able to generate a 500k record csv file in ~10 seconds by using

Mimesis Library

  • Might not be the cleanest and best approach but solved my problem [open to more feedback]

Updated code:


    from mimesis import Person, Numbers
    from mimesis.builtins import USASpecProvider
    
    person = Person()
    number = Numbers()
    us = USASpecProvider()
    column_headers = ["email", "customerId", "firstName", "lastName"]
    
    with open('test_csv_file.csv', 'w') as csvf:
        writer = csv.writer(csvf)
        writer.writerow(column_headers)
        for _ in range(500000):
            row = [person.email(unique=True), us.tracking_number(), person.first_name(), person.last_name()]
            writer.writerow(row)

philomath
  • 113
  • 2
  • 9