15

I want to write some random sample data in a csv file until it is 1GB big. Following code is working:

import numpy as np
import uuid
import csv
import os
outfile = 'data.csv'
outsize = 1024 # MB
with open(outfile, 'ab') as csvfile:
    wtr = csv.writer(csvfile)
    while (os.path.getsize(outfile)//1024**2) < outsize:
        wtr.writerow(['%s,%.6f,%.6f,%i' % (uuid.uuid4(), np.random.random()*50, np.random.random()*50, np.random.randint(1000))])    

How to get it faster?

Stephen
  • 8,508
  • 12
  • 56
  • 96
Balzer82
  • 999
  • 4
  • 10
  • 21
  • 3
    Why do you tag this question with numpy, but don't use it (it isn't needed for random numbers)? Why creating a csv-writer but write only one string per line? The is not given, that the size of a file is updated while the file is not closed. Calculate the size on your own, and don't use `getsize`, much faster also. – Daniel Jan 01 '15 at 14:13

3 Answers3

15

The problem appears to be mainly IO-bound. You can improve the I/O a bit by writing to the file in larger chunks instead of writing one line at a time:

import numpy as np
import uuid
import os
outfile = 'data-alt.csv'
outsize = 10 # MB
chunksize = 1000
with open(outfile, 'ab') as csvfile:
    while (os.path.getsize(outfile)//1024**2) < outsize:
        data = [[uuid.uuid4() for i in range(chunksize)],
                np.random.random(chunksize)*50,
                np.random.random(chunksize)*50,
                np.random.randint(1000, size=(chunksize,))]
        csvfile.writelines(['%s,%.6f,%.6f,%i\n' % row for row in zip(*data)])   

You can experiment with the chunksize (the number of rows written per chunk) to see what works best on your machine.


Here is a benchmark, comparing the above code to your original code, with outsize set to 10 MB:

% time original.py

real    0m5.379s
user    0m4.839s
sys 0m0.538s

% time write_in_chunks.py

real    0m4.205s
user    0m3.850s
sys 0m0.351s

So this is is about 25% faster than the original code.


PS. I tried replacing the calls to os.path.getsize with an estimation of the number of total lines needed. Unfortunately, it did not improve the speed. Since the number of bytes needed to represent the final int varies, the estimation also is inexact -- that is, it does not perfectly replicate the behavior of your original code. So I left the os.path.getsize in place.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
8

Removing all unnecessary stuff, and therefore it should be faster and easier to understand:

import random
import uuid
outfile = 'data.csv'
outsize = 1024 * 1024 * 1024 # 1GB
with open(outfile, 'ab') as csvfile:
    size = 0
    while size < outsize:
        txt = '%s,%.6f,%.6f,%i\n' % (uuid.uuid4(), random.random()*50, random.random()*50, random.randrange(1000))
        size += len(txt)
        csvfile.write(txt)
Daniel
  • 42,087
  • 4
  • 55
  • 81
  • Is len(txt) == filesize? And `random.randint(1000)` takes 2 arguments. – Balzer82 Jan 01 '15 at 14:53
  • randint -> randrange. And `len(txt)` is the length of one line. – Daniel Jan 01 '15 at 14:56
  • OK. But length of one line or sum of length of lines is not the filesize. BTW, your code is not faster. Try it out. – Balzer82 Jan 01 '15 at 14:57
  • 2
    @Balzer82, the fastest way to write is probably buying a SSD :). Optimizing code where the bottleneck is in the IO is rather difficult. There is a lot of low-level buffering and optimization happening, which we cannot see. Don't be too surprised that a code which should run faster, is actually not significantly faster. – cel Jan 01 '15 at 15:36
0

This is an update building on unutbu's answer above:

A large % of the time was spent in generating random numbers and checking the file size.

If you generate the rows ahead of time you can assess the raw disc io performance:

import time
from pathlib import Path
import numpy as np
import uuid
outfile = Path('data-alt.csv')
chunksize = 1_800_000

data = [
    [uuid.uuid4() for i in range(chunksize)],
    np.random.random(chunksize) * 50,
    np.random.random(chunksize) * 50,
    np.random.randint(1000, size=(chunksize,))
]
rows = ['%s,%.6f,%.6f,%i\n' % row for row in zip(*data)]

t0 = time.time()
with open(outfile, 'a') as csvfile:
    csvfile.writelines(rows)
tdelta = time.time() - t0
print(tdelta)

On my standard 860 evo ssd (not nvme), I get 1.43 sec for 1_800_000 rows so that's 1,258,741 rows/sec (not too shabby imo)

AustEcon
  • 122
  • 1
  • 3
  • 8