2

MySQL 5.7.18
Python 2.7.5
Pandas 0.17.1
CentOS 7.3

A MySQL table:

CREATE TABLE test (
  id varchar(12)
) ENGINE=InnoDB;

The size is 10GB.

select round(((data_length) / 1024 / 1024 / 1024)) "GB"
from information_schema.tables 
where table_name = "test"

10GB

The box has 250GB memory:

$ free -hm
              total        used        free      shared  buff/cache   available
Mem:           251G         15G        214G        2.3G         21G        232G
Swap:          2.0G        1.2G        839M

Select the data:

import psutil
print '1 ' + str(psutil.phymem_usage())

import os
import sys
import time
import pyodbc 
import mysql.connector
import pandas as pd
from datetime import date
import gc
print '2 ' + str(psutil.phymem_usage())

db = mysql.connector.connect({snip})
c = db.cursor()
print '3 ' + str(psutil.phymem_usage())

c.execute("select id from test")
print '4 ' + str(psutil.phymem_usage())

e=c.fetchall()
print 'getsizeof: ' + str(sys.getsizeof(e))
print '5 ' + str(psutil.phymem_usage())

d=pd.DataFrame(e)
print d.info()
print '6 ' + str(psutil.phymem_usage())

c.close()
print '7 ' + str(psutil.phymem_usage())

db.close()
print '8 ' + str(psutil.phymem_usage())

del c, db, e
print '9 ' + str(psutil.phymem_usage())

gc.collect()
print '10 ' + str(psutil.phymem_usage())

time.sleep(60)
print '11 ' + str(psutil.phymem_usage())

The output:

1 svmem(total=270194331648L, available=249765777408L, percent=7.6, used=39435464704L, free=230758866944L, active=20528222208, inactive=13648789504, buffers=345387008L, cached=18661523456)
2 svmem(total=270194331648L, available=249729019904L, percent=7.6, used=39472222208L, free=230722109440L, active=20563484672, inactive=13648793600, buffers=345387008L, cached=18661523456)
3 svmem(total=270194331648L, available=249729019904L, percent=7.6, used=39472222208L, free=230722109440L, active=20563484672, inactive=13648793600, buffers=345387008L, cached=18661523456)
4 svmem(total=270194331648L, available=249729019904L, percent=7.6, used=39472222208L, free=230722109440L, active=20563484672, inactive=13648793600, buffers=345387008L, cached=18661523456)
getsizeof: 1960771816
5 svmem(total=270194331648L, available=181568315392L, percent=32.8, used=107641655296L, free=162552676352L, active=88588271616, inactive=13656334336, buffers=345395200L, cached=18670243840)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 231246823 entries, 0 to 231246822
Data columns (total 1 columns):
0    object
dtypes: object(1)
memory usage: 3.4+ GB
None
6 svmem(total=270194331648L, available=181571620864L, percent=32.8, used=107638353920L, free=162555977728L, active=88587603968, inactive=13656334336, buffers=345395200L, cached=18670247936)
7 svmem(total=270194331648L, available=181571620864L, percent=32.8, used=107638353920L, free=162555977728L, active=88587603968, inactive=13656334336, buffers=345395200L, cached=18670247936)
8 svmem(total=270194331648L, available=181571620864L, percent=32.8, used=107638353920L, free=162555977728L, active=88587603968, inactive=13656334336, buffers=345395200L, cached=18670247936)
9 svmem(total=270194331648L, available=183428308992L, percent=32.1, used=105781678080L, free=164412653568L, active=86735921152, inactive=13656334336, buffers=345395200L, cached=18670260224)
10 svmem(total=270194331648L, available=183428308992L, percent=32.1, used=105781678080L, free=164412653568L, active=86735921152, inactive=13656334336, buffers=345395200L, cached=18670260224)
11 svmem(total=270194331648L, available=183427203072L, percent=32.1, used=105782812672L, free=164411518976L, active=86736560128, inactive=13656330240, buffers=345395200L, cached=18670288896)

I even deleted the database connection and called garbage collection.

How could a 10GB table use up 60GB of my memory?

smci
  • 32,567
  • 20
  • 113
  • 146
davidjhp
  • 7,816
  • 9
  • 36
  • 56
  • What happens if you separate `DataFrame` creation and `c.fetchall()` and print memory usage in between? – Andrey Portnoy Sep 02 '18 at 00:28
  • I separated the two items as you stated, and replaced the code and output in the above question. It seems all the memory is lost immediately at fetchall(). – davidjhp Sep 02 '18 at 02:02
  • Similar: [Python hangs on fetchall using MySQL connector](https://stackoverflow.com/questions/31486547/python-hangs-on-fetchall-using-mysql-connector) – smci Sep 02 '18 at 08:09
  • Your fetchall is returning 231 million rows, seems like a bad idea. You can set the fetch size to something more sane. Also, does MySQL connector have a context handler? – smci Sep 02 '18 at 08:29
  • Can you do a `getsizeof` on an element of `e` list of tuples? And also no each element of tuple? – FelixEnescu Sep 02 '18 at 10:50

2 Answers2

2

The short answer: python data structures memory overhead.

You have a table with ~231M rows taking ~10GB, so each row has about 4 bytes.

fetchall translate that into a list of tuples like this:

[('abcd',), ('1234',), ... ]

Your list has ~231M elements and uses ~19GB of memory: on average each tuple uses 8.48 bytes.

$ python
Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys

A tuple:

>>> a = ('abcd',)
>>> sys.getsizeof(a)
64

A list of one tuple:

>>> al = [('abcd',)]
>>> sys.getsizeof(al)
80

A list of two tuples:

>>> al2 = [('abcd',), ('1234',)]
>>> sys.getsizeof(al2)
88

A list with 10 tuples:

>>> al10 = [ ('abcd',) for x in range(10)]
>>> sys.getsizeof(al10)
200

A list with 1M tuples:

>>> a_realy_long = [ ('abcd',) for x in range(1000000)]
>>> sys.getsizeof(a_realy_long )
8697472

Almost our number: 8.6 bytes per tuple in the list.

Unfortunately there isn't much you can do here: mysql.connector chooses the data structure and dict cursor would use even more memory.

If you need to reduce memory usage you must use fetchmany with suitable size argument.

FelixEnescu
  • 4,664
  • 2
  • 33
  • 34
  • Your answer states that a table of 10GB turns into 19GB in memory. That sounds like an increase of 9GB. How does that explain why 60GB disappears? – davidjhp Sep 02 '18 at 15:53
  • Right - it may be internal usage of `mysql.connector`; could you please try `fetchmany` with a page size of 100M? – FelixEnescu Sep 03 '18 at 08:43
  • I rewrote to use fetchmany() and a page size of 100M. The result was memory used decreased to 23% which means 40GB used. This is an improvement but still is a mystery why a 10GB table uses 40GB memory. – davidjhp Sep 07 '18 at 15:45
0

Edit: pd.read_sql only accepts SQLAlchemy connections. Start by using create_engine from SQLAlchemy to connect to your database:

from sqlalchemy import create_engine
engine = create_engine('mysql://database')

then call .connect() on the resulting object:

connection = engine.connect()

Pass that connection to pd.read_sql:

df = pd.read_sql("select id from test", connection)

This should decrease your memory footprint.

Would you mind posting the memory usage results after you try the above?

Andrey Portnoy
  • 1,430
  • 15
  • 24
  • I rewrote the script to use pd.read_sql() as you stated, but the memory usage results were the same. By the way I got pd.read_sql() to work without using SQLAlchemy. – davidjhp Sep 02 '18 at 03:25
  • 1
    In the original script the problem starts at "e=c.fetchall()" which is before I get into Pandas. So the problem is not Pandas, it is before that. – davidjhp Sep 02 '18 at 03:28
  • @davidjhp Right. I don't have any further advice for you as I'm unfamiliar with the implementation outside of Pandas. Maybe this answer (and question) will lead you somewhere: https://stackoverflow.com/a/17861221 – Andrey Portnoy Sep 02 '18 at 03:35