1

I have the following script to fill a test database for me.

#!/usr/bin/python

import random, sys, sqlite3

con = sqlite3.connect('test.db')

cur = con.cursor()
cur.execute("DROP TABLE IF EXISTS EXAMPLE")
cur.execute("CREATE TABLE EXAMPLE(FIRST TEXT, SECOND TEXT)")

for i in range(0, 99999999):
    one = format(i, '08d')
    two = "%0.8d" % random.randint(0,99999999)
    cur.execute("INSERT INTO EXAMPLE VALUES(\'"+one+"\',\'"+two+"\')")

    con.commit()

    # to have some feedback of the progress
    if i % 100000 == 0:
        print (str(i))

con.close()
print ("done")
# wait in the script..
sys.stdin.readline()

Now the script takes up all the RAM it can (runs currently in a VM with 3GB - takes up ~2.8GB) within about 2 minutes (edit: within a few seconds) and never reached the if i % 100000 == 0:. If I terminate it and check the test.db file it's 3KB big and contains the table only, no entries.

Do I need to close and re-open the connection every now and then?

Daedalus Mythos
  • 565
  • 2
  • 8
  • 24
  • `for i in range(0, 99999999)` allocates the whole list up front. Try replacing `range` with `irange` as a start. It's implemented as a generator instead and will build a stream instead of a list (conceptually). A list of 100mio entries is a large list :) – Morten Jensen Sep 24 '14 at 12:12
  • thanks for the info, I've never used `irange`.. how do I use it? Couldn't find it in the python doc.. – Daedalus Mythos Sep 24 '14 at 12:24
  • Python 2 or 3? If 3, then `range` is fine and doesn't allocate the entire list. If not, use xrange, not irange. – Colonel Thirty Two Sep 24 '14 at 12:32
  • Also, move `con.commit` to after the loop. Insertions are fast, but commits are really slow, so you want to batch as much work as you can in one commit. – Colonel Thirty Two Sep 24 '14 at 12:36
  • I could use either - what would you suggest Python 3 with `range` or Python 2 with `xrange`? – Daedalus Mythos Sep 24 '14 at 12:36

1 Answers1

1

As my comment suggested, you should consider using irange instead of range when returning lists with many elements in them, especially if you are only going to use the list once.

The difference lies in the implementation. range creates the whole list up front, and returns elements from the list. irange just creates and returns the next element from the list each time, and so it doesn't do as much work up front.

The function calls are interchangeable to my knowledge, so just replace range in your code with irange and you should see less RAM consumption.

Check this question out also: How is irange() any different from range() or xrange()?

EDIT:

Sorry, I was answering a bit fast. Use xrange and not irange as I previously wrote. For when to use one over the other: Should you always favor xrange() over range()?

The reason it doesn't matter if you use range or xrange in Python 3 is, AFAIK because range is implemented as xrange.

Community
  • 1
  • 1
Morten Jensen
  • 5,818
  • 3
  • 43
  • 55
  • I tried to do so - `for i in irange(0, 99999999):` instead of `for i in range(0, 99999999):`, but i get `NameError: name 'irange' is not defined` – Daedalus Mythos Sep 24 '14 at 12:31
  • @DaedalusMythos does a change from `range` to `xrange` make any difference on the RAM consumption :) ? – Morten Jensen Sep 25 '14 at 07:51
  • yes. absolutely! the RAM consumption is now very low and works as intended! – Daedalus Mythos Sep 25 '14 at 09:40
  • @DaedalusMythos That's great! :) I often try to implement list-returning functions that I iterate over, as generators. Whenever something is used like a stream, I prefer a generator. Thanks for accepting my answer :) – Morten Jensen Sep 25 '14 at 10:01