Efficient way of writing 500k rows of data to a csv file

Question

My requirement is to generate a csv file with ~500,000 (unique) records which has the following column headers:

csv file example:
    email,customerId,firstName,lastName
    qa+d43e5efc-6b0f-46ce-a14e-1db63bb77882@example.com,0d981ae1be954ea7-b411-28a98e3ddba2,Daniel,Newton

I tried to write below piece of code for this but wanted to know that is there a better/efficient way to do this Its my first time dealing with a large data set and currently my code takes a really long time to run [more than an hour] so really looking for suggestions/feedback. Thanks

    with open('test_csv_file.csv', 'w') as csvf:
    writer = csv.writer(csvf)
    column_headers = ("email", "customerId", "firstName", "lastName")
    writer.writerow(column_headers)
    for _ in range(500000):
        fake = Faker()
        row = (f'qa+{uuid4()}@example.com', uuid4(), fake.first_name(), fake.last_name())
        writer.writerow(row)

See [How can you profile a Python script?](https://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script) although you will may want a [line-profiler](https://pypi.org/project/line-profiler/) for this. — martineau, Aug 11 '21 at 17:26
I can think of an approach where i generate 500k rows in a different step and then write them to csv but not sure and would need advice. Thanks — philomath, Aug 11 '21 at 17:32
That's exactly why I was suggesting that you profile your code. Also see [Which is faster?](https://ericlippert.com/2012/12/17/performance-rant/) — martineau, Aug 11 '21 at 17:33
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/235902/discussion-between-philomath-and-martineau). — philomath, Aug 11 '21 at 17:35
See my ending comments about performance in the discussion under [this unrelated question](https://stackoverflow.com/a/64991758/355230). — martineau, Aug 11 '21 at 17:46
Writing this amount of data to the disk will hugely dominate over any code inefficiencies. You are limited by the disk's bandwidth. — tripleee, Aug 11 '21 at 18:05
@tripleee So how do I achieve my requirement of creating a csv file with 500k unique records ? — philomath, Aug 11 '21 at 18:07
Your indentation is clearly incorrect; please [edit] to fix it. On the desktop version of this site, paste your code, then select the pasted block and type ctrl-K to correctly format your code with indentations preserved. — tripleee, Aug 11 '21 at 18:07
What I'm trying to say is that your code is fine, you might be able to reduce the amount of CPU it requires (we can't see how `Faker` is defined but maybe it contains something you could optimize?) but writing this amount of data to the disk takes as long as the disk takes to copy this amount of stuff onto the media, regardless of how quick you manage to make your code. Simply copying 500k times your average line length takes time. — tripleee, Aug 11 '21 at 18:10
@tripleee What's that amount of data that you're talking about? I only see the "500k records", not how large each record is. Am I overlooking something or were comments deleted? And their answer sounds like they dropped the time from more than an hour to just 10 seconds with different code but presumably on the same disk, so how were they "limited by the disk's bandwidth"? — Kelly Bundy, Aug 12 '21 at 11:09
Nothing in the code they posted here should take that long. If `Faker` is that inefficient, that might be worth investigating; but that's unreproducible without more details. The calculation 500k times average line (record) length should probably translate to a few dozen megabytes, which is in and of itself not too much by today's standards. — tripleee, Aug 12 '21 at 11:18
1)you could run profiler and see what takes time, 2) you might have a bottleneck since you are doing a class and formatting the string before putting into the file, 3)is saving the file each line? 4)you can also test how much time would take if you just do that into a normal file insetead the csv library, "with (fake_file.txt, "w") as fake_file:" .... one library might be faster than other, last year i had something similar, i try make if faster in Go and and ended up been 4 time slower. — pelos, Aug 12 '21 at 18:33

score 0 · Accepted Answer · answered Aug 11 '21 at 20:44

So I was able to generate a 500k record csv file in ~10 seconds by using

Mimesis Library

Might not be the cleanest and best approach but solved my problem [open to more feedback]

Updated code:


    from mimesis import Person, Numbers
    from mimesis.builtins import USASpecProvider
    
    person = Person()
    number = Numbers()
    us = USASpecProvider()
    column_headers = ["email", "customerId", "firstName", "lastName"]
    
    with open('test_csv_file.csv', 'w') as csvf:
        writer = csv.writer(csvf)
        writer.writerow(column_headers)
        for _ in range(500000):
            row = [person.email(unique=True), us.tracking_number(), person.first_name(), person.last_name()]
            writer.writerow(row)

Efficient way of writing 500k rows of data to a csv file

1 Answers1