Why does 60GB memory disappear on a MySQL connector fetchall()?

Question

MySQL 5.7.18
Python 2.7.5
Pandas 0.17.1
CentOS 7.3

A MySQL table:

CREATE TABLE test (
  id varchar(12)
) ENGINE=InnoDB;

The size is 10GB.

select round(((data_length) / 1024 / 1024 / 1024)) "GB"
from information_schema.tables 
where table_name = "test"

10GB

The box has 250GB memory:

$ free -hm
              total        used        free      shared  buff/cache   available
Mem:           251G         15G        214G        2.3G         21G        232G
Swap:          2.0G        1.2G        839M

Select the data:

import psutil
print '1 ' + str(psutil.phymem_usage())

import os
import sys
import time
import pyodbc 
import mysql.connector
import pandas as pd
from datetime import date
import gc
print '2 ' + str(psutil.phymem_usage())

db = mysql.connector.connect({snip})
c = db.cursor()
print '3 ' + str(psutil.phymem_usage())

c.execute("select id from test")
print '4 ' + str(psutil.phymem_usage())

e=c.fetchall()
print 'getsizeof: ' + str(sys.getsizeof(e))
print '5 ' + str(psutil.phymem_usage())

d=pd.DataFrame(e)
print d.info()
print '6 ' + str(psutil.phymem_usage())

c.close()
print '7 ' + str(psutil.phymem_usage())

db.close()
print '8 ' + str(psutil.phymem_usage())

del c, db, e
print '9 ' + str(psutil.phymem_usage())

gc.collect()
print '10 ' + str(psutil.phymem_usage())

time.sleep(60)
print '11 ' + str(psutil.phymem_usage())

The output:

1 svmem(total=270194331648L, available=249765777408L, percent=7.6, used=39435464704L, free=230758866944L, active=20528222208, inactive=13648789504, buffers=345387008L, cached=18661523456)
2 svmem(total=270194331648L, available=249729019904L, percent=7.6, used=39472222208L, free=230722109440L, active=20563484672, inactive=13648793600, buffers=345387008L, cached=18661523456)
3 svmem(total=270194331648L, available=249729019904L, percent=7.6, used=39472222208L, free=230722109440L, active=20563484672, inactive=13648793600, buffers=345387008L, cached=18661523456)
4 svmem(total=270194331648L, available=249729019904L, percent=7.6, used=39472222208L, free=230722109440L, active=20563484672, inactive=13648793600, buffers=345387008L, cached=18661523456)
getsizeof: 1960771816
5 svmem(total=270194331648L, available=181568315392L, percent=32.8, used=107641655296L, free=162552676352L, active=88588271616, inactive=13656334336, buffers=345395200L, cached=18670243840)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 231246823 entries, 0 to 231246822
Data columns (total 1 columns):
0    object
dtypes: object(1)
memory usage: 3.4+ GB
None
6 svmem(total=270194331648L, available=181571620864L, percent=32.8, used=107638353920L, free=162555977728L, active=88587603968, inactive=13656334336, buffers=345395200L, cached=18670247936)
7 svmem(total=270194331648L, available=181571620864L, percent=32.8, used=107638353920L, free=162555977728L, active=88587603968, inactive=13656334336, buffers=345395200L, cached=18670247936)
8 svmem(total=270194331648L, available=181571620864L, percent=32.8, used=107638353920L, free=162555977728L, active=88587603968, inactive=13656334336, buffers=345395200L, cached=18670247936)
9 svmem(total=270194331648L, available=183428308992L, percent=32.1, used=105781678080L, free=164412653568L, active=86735921152, inactive=13656334336, buffers=345395200L, cached=18670260224)
10 svmem(total=270194331648L, available=183428308992L, percent=32.1, used=105781678080L, free=164412653568L, active=86735921152, inactive=13656334336, buffers=345395200L, cached=18670260224)
11 svmem(total=270194331648L, available=183427203072L, percent=32.1, used=105782812672L, free=164411518976L, active=86736560128, inactive=13656330240, buffers=345395200L, cached=18670288896)

I even deleted the database connection and called garbage collection.

How could a 10GB table use up 60GB of my memory?

What happens if you separate `DataFrame` creation and `c.fetchall()` and print memory usage in between? — Andrey Portnoy, Sep 02 '18 at 00:28
I separated the two items as you stated, and replaced the code and output in the above question. It seems all the memory is lost immediately at fetchall(). — davidjhp, Sep 02 '18 at 02:02
Similar: [Python hangs on fetchall using MySQL connector](https://stackoverflow.com/questions/31486547/python-hangs-on-fetchall-using-mysql-connector) — smci, Sep 02 '18 at 08:09
Your fetchall is returning 231 million rows, seems like a bad idea. You can set the fetch size to something more sane. Also, does MySQL connector have a context handler? — smci, Sep 02 '18 at 08:29
Can you do a `getsizeof` on an element of `e` list of tuples? And also no each element of tuple? — FelixEnescu, Sep 02 '18 at 10:50

score 2 · Answer 1 · answered Sep 02 '18 at 11:30

The short answer: python data structures memory overhead.

You have a table with ~231M rows taking ~10GB, so each row has about 4 bytes.

fetchall translate that into a list of tuples like this:

[('abcd',), ('1234',), ... ]

Your list has ~231M elements and uses ~19GB of memory: on average each tuple uses 8.48 bytes.

$ python
Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys

A tuple:

>>> a = ('abcd',)
>>> sys.getsizeof(a)
64

A list of one tuple:

>>> al = [('abcd',)]
>>> sys.getsizeof(al)
80

A list of two tuples:

>>> al2 = [('abcd',), ('1234',)]
>>> sys.getsizeof(al2)
88

A list with 10 tuples:

>>> al10 = [ ('abcd',) for x in range(10)]
>>> sys.getsizeof(al10)
200

A list with 1M tuples:

>>> a_realy_long = [ ('abcd',) for x in range(1000000)]
>>> sys.getsizeof(a_realy_long )
8697472

Almost our number: 8.6 bytes per tuple in the list.

Unfortunately there isn't much you can do here: mysql.connector chooses the data structure and dict cursor would use even more memory.

If you need to reduce memory usage you must use fetchmany with suitable size argument.

Your answer states that a table of 10GB turns into 19GB in memory. That sounds like an increase of 9GB. How does that explain why 60GB disappears? — davidjhp, Sep 02 '18 at 15:53
Right - it may be internal usage of `mysql.connector`; could you please try `fetchmany` with a page size of 100M? — FelixEnescu, Sep 03 '18 at 08:43
I rewrote to use fetchmany() and a page size of 100M. The result was memory used decreased to 23% which means 40GB used. This is an improvement but still is a mystery why a 10GB table uses 40GB memory. — davidjhp, Sep 07 '18 at 15:45

Andrey Portnoy · Answer 2 · 2018-09-02T02:32:07.847

0

Edit: pd.read_sql only accepts SQLAlchemy connections. Start by using create_engine from SQLAlchemy to connect to your database:

from sqlalchemy import create_engine
engine = create_engine('mysql://database')

then call .connect() on the resulting object:

connection = engine.connect()

Pass that connection to pd.read_sql:

df = pd.read_sql("select id from test", connection)

This should decrease your memory footprint.

Would you mind posting the memory usage results after you try the above?

edited Sep 02 '18 at 02:32

answered Sep 02 '18 at 02:11

Andrey Portnoy

1,430
15
24

I rewrote the script to use pd.read_sql() as you stated, but the memory usage results were the same. By the way I got pd.read_sql() to work without using SQLAlchemy. – davidjhp Sep 02 '18 at 03:25
1

In the original script the problem starts at "e=c.fetchall()" which is before I get into Pandas. So the problem is not Pandas, it is before that. – davidjhp Sep 02 '18 at 03:28
@davidjhp Right. I don't have any further advice for you as I'm unfamiliar with the implementation outside of Pandas. Maybe this answer (and question) will lead you somewhere: https://stackoverflow.com/a/17861221 – Andrey Portnoy Sep 02 '18 at 03:35

Why does 60GB memory disappear on a MySQL connector fetchall()?

2 Answers2