2

I'm trying to extract huge amounts of data from a DB and write it to a csv file. I'm trying to find out what the fastest way would be to do this. I found that running writerows on the result of a fetchall was 40% slower than the code below.

with open(filename, 'a') as f:
    writer = csv.writer(f, delimiter='\t')
    cursor.execute("SELECT * FROM table")
    writer.writerow([i[0] for i in cursor.description])

    count = 0
    builder = []
    row = cursor.fetchone()
    DELIMITERS = ['\t'] * (len(row) - 1) + ['\n']
    while row:
        count += 1
        # Add row with delimiters to builder 
        builder += [str(item) for pair in zip(row, DELIMITERS) for item in pair]
        if count == 1000:
            count = 0
            f.write(''.join(builder))
            builder[:] = []
        row = cursor.fetchone()
    f.write(''.join(builder))

Edit: The database I'm using is unique to the small company that I'm working for, so unfortunately I can't provide much information on that front. I'm using jpype to connect with the database since the only means of connecting is via a jdbc driver. I'm running cPython 2.7.5; would love to use PyPy but it doesn't work with Pandas.

Since I'm extracting such a large number of rows, I'm hesitant to use fetchall for fear that I'll run out of memory. row has comparable performance and is much easier on the eyes, so I think I'll use that. Thanks a bunch!

Luke
  • 6,699
  • 13
  • 50
  • 88
  • 1
    Which database? Which interface library? Can you give us a table schema and some data? – abarnert Jan 21 '14 at 22:16
  • 1
    For example: With some databases, `fetchone` fetches one row at a time (or, even better, one buffer full of rows at a time) from the server, which makes it a lot more efficient than `fetchall`, which obviously fetches all of them, meaning you can't do any work until they're all ready, and you also probably waste a lot of work allocating and managing the memory for them. But in others, `fetchone` does (in effect) a `fetchall` behind the scenes, so there's no advantage, and in fact a small slowdown. – abarnert Jan 21 '14 at 22:18
  • Also, your code will generate a broken CSV file if there are any tabs, newlines, or other special characters in any of the data. Is that acceptable? – abarnert Jan 21 '14 at 22:19
  • It is critical to know which DB system you are working with to answer this question. In general it is probably better to create the table from the DB side so you don't have to worry about the overhead of bringing it into Python. I would guess fastest would be using COPY like in http://stackoverflow.com/questions/1517635/save-postgres-sql-output-to-csv-file – GrantD71 Jan 21 '14 at 22:32

1 Answers1

5

With the little you've given us to go on, it's hard to be more specific, but…

I've wrapped your code up as a function, and written three alternative versions:

def row():
    with open(filename, 'w') as f:
        writer = csv.writer(f, delimiter='\t')
        cursor = db.execute("SELECT * FROM mytable")
        writer.writerow([i[0] for i in cursor.description])
        for row in cursor:
            writer.writerow(row)

def rows():
    with open(filename, 'w') as f:
        writer = csv.writer(f, delimiter='\t')
        cursor = db.execute("SELECT * FROM mytable")
        writer.writerow([i[0] for i in cursor.description])
        writer.writerows(cursor)

def rowsall():
    with open(filename, 'w') as f:
        writer = csv.writer(f, delimiter='\t')
        cursor = db.execute("SELECT * FROM mytable")
        writer.writerow([i[0] for i in cursor.description])
        writer.writerows(cursor.fetchall())

Notice that the last one is the one you say you tried.

Now, I wrote this test driver:

def randomname():
    return ''.join(random.choice(string.ascii_lowercase) for _ in range(30))

db = sqlite3.connect(':memory:')
db.execute('CREATE TABLE mytable (id INTEGER PRIMARY KEY AUTOINCREMENT, name VARCHAR)')
db.executemany('INSERT INTO mytable (name) VALUES (?)',
               [[randomname()] for _ in range(10000)])

filename = 'db.csv'

for f in manual, row, rows, rowsall:
    t = timeit.timeit(f, number=1)
    print('{:<10} {}'.format(f.__name__, t))

And here are the results:

manual     0.055549702141433954
row        0.03852885402739048
rows       0.03992213006131351
rowsall    0.02850699401460588

So, your code takes nearly twice as long as calling fetchall and writerows in my test!

When I repeat a similar test with other databases, however, rowsall is anywhere from 20% faster to 15% slower than manual (never 40% slower, but as much as 15%)… but row or rows is always significantly faster than manual.

I think the explanation is that your custom code is significantly slower than csv.writerows, but that in some databases, using fetchall instead of fetchone (or just iterating the cursor) slows things down significantly. The reason this isn't true with an in-memory sqlite3 database is that fetchone is doing all of the same work as fetchall and then feeding you the list one at a time; with a remote database, fetchone may do anything from fetch all the lines, to fetching a buffer at a time, to fetching a row at a time, making it potentially much slower or faster than fetchall, depending on your data.

But for a really useful explanation, you'd have to tell us exactly which database and library you're using (and which Python version—CPython 3.3.2's csv module seems to be a lot faster than CPython 2.7.5's, and PyPy 2.1/2.7.2 seems to be faster than CPython 2.7.5 as well, but then either one also might run your code faster too…) and so on.

abarnert
  • 354,177
  • 51
  • 601
  • 671