1

I'm new to MySQL, and I have a question about the memory.

I have a 200mb table(MyISAM, 2,000,000 rows), and I try to load all of it to the memory.

I use python(actually MySQLdb in python) with sql: SELECT * FROM table.

However, from my linux "top" I saw this python process uses 50% of my memory(which is total 6GB)

I'm curious about why it uses about 3GB memory only for a 200 mb table. Thanks in advance!

Huang Yen-Chieh
  • 635
  • 2
  • 11
  • 24
  • No production database bindings should be using memory proportional to the size of the request; as long as you don't keep the results around yourself, the old data should be released as you receive more data. Not all database bindings do this properly, but that's always a bug; you should be able to select an infinite range from the database backend, process the results incrementally and memory usage should not ground without bound. – Glenn Maynard Dec 30 '10 at 02:11
  • It's almost _always_ a bad idea to use `select *`, some DB tools notwithstanding. You should generally explicitly list the columns you want. – paxdiablo May 26 '14 at 04:39

4 Answers4

1

There's nothing inherently wrong with what you're doing. If memory usage is expanding with the size of the query, then one of a few things is happening:

  • You're leaking references to the results you're receiving; for example, putting them in a list somewhere. I suspect you'd know if you were doing that.
  • The database bindings or underlying library isn't freeing memory from previous rows while reading new rows from a query. This is usually a bug. It may happen normally if you have debugging features on, but it should not happen by default.

Note that the underlying library may cache a certain amount of data, and you may see a significant amount of memory use as a result, but unless something is catastrophically wrong with the configuration it shouldn't be 3GB.

Here's some simple SQLite code that reproduces what you're doing. When run, it creates a trivial table of fifteen million rows, which is about 180 MB on disk for the version I'm using. It then SELECTs all of this data, throwing away the results, and sleeps so you can examine the result. The resulting process, on my system, uses only 15 MB.

(Note that I ran the create_db and read_db passes with separate invocations; creating the database takes some time.)

SQLite can handle this, and any production server-backed database like MySQL and Postgresql should be able to, too. SELECT results are a data stream, and databases should be able to handle streams of unlimited size easily.

import sqlite3
def create_db(conn):
    c = conn.cursor()
    c.execute('create table test (i integer)')
    conn.commit()
    max_val = 15000000
    chunk = 1000000
    for start in xrange(0, max_val, chunk):
        print "%i ..." % start
        for i in xrange(start, start + chunk):
            c = conn.cursor()
            c.execute('insert into test (i) values (?)', (i,))
        conn.commit()

def read_db(conn):
    c = conn.cursor()
    c.execute('select * from test')
    for x in xrange(15000000):
        c.fetchone()

    print "Done"

    # Sleep forever, to examine memory usage:
    while True:
        time.sleep(1)

def go():
    conn = sqlite3.connect('test.db')

    # Pick one:
    create_db(conn)
    # read_db(conn)

if __name__ == "__main__":
    go()

This doesn't answer your question, but I wanted to make it clear that there's nothing wrong with what you're doing--you should not need to manually chunk queries, though in the end that's probably the workaround you'll need.

Glenn Maynard
  • 55,829
  • 10
  • 121
  • 131
0

In pretty much any scripting language, a variable will always take up more memory than its actual contents would suggest. An INT might be 32 or 64bits, suggesting it would require 4 or 8 bytes of memory, but it will take up 16 or 32bytes (pulling numbers out of my hat), because the language interpreter has to attach various metadata to that value along the way.

The database might only require 200megabytes of raw storage space, but once you factor in the metadata, it will definitely occupy much much more.

Marc B
  • 356,200
  • 43
  • 426
  • 500
-1

This is a combination Marc B's answer and a gotcha (not a bug) in MySQLdb. The default cursor in MySQLdb is a client side cursor, meaning the client library marshals the entire result set in memory within the client process. See the answer to

How to get a row-by-row MySQL ResultSet in python

for a solution to this problem (essentially using the a server side cursor class). The more columns you have returned in your select query, the bigger the in memory expansion you can expect, because each column causes additional interpreter metadata to be created. 3Gb seems huge though, unless you have dozens of columns.

Community
  • 1
  • 1
sirlark
  • 2,187
  • 2
  • 18
  • 28
-1

This is almost certainly a bad design.

What are you doing with all that data in memory at once?

If it's for one user, why not pare the size down so you can support multiple users?

If you're doing a calculation on the middle tier, is it possible to shift the work to the database server so you don't have to bring all the data into memory?

You know you can do this, but the larger questions are (1) why? and (2) what else could you do? We'd need more context to answer these.

duffymo
  • 305,152
  • 44
  • 369
  • 561
  • Thanks for reply. I think I know what I should do next. Actually I'm processing some data from the old_table and insert the result to new_table row by row. So maybe I should split the data into smaller sets and processing them one by one. Thanks! – Huang Yen-Chieh Dec 30 '10 at 02:04
  • I'd suggest that using a stored procedure and doing the work on the database might make more sense in that case. – duffymo Dec 30 '10 at 02:22