2

I have been fetching data from a Postgres database using Python. And it is taking up a lot of memory. As can be seen below:

memory usage

The following function is the only function that I am running, and it is taking up an excessive amount of memory. I am using fetchmany() and fetching data in small chunks. I have also tried to use the cur cursor iteratively. However, all of these methods end up with really excessive amounts of memory usage. Does anyone have any clue as to why This is happening? Is there anything that I need to tune at the Postgres end that can help mitigate this problem??

def checkMultipleLine(dbName):
    '''
    Checks for rows that contain data spanning multiple lines

    This is the most basic of checks. If a aprticular row has 
    data that spans multiple lines, then that particular row
    is corrupt. For dealing with these rows we must first find 
    out whether there are places in the database that contains
    data that spans multiple lines. 
    '''

    logger = logging.getLogger('mindLinc.checkSchema.checkMultipleLines')
    logger.info('Finding rows that span multiple lines')

    schema = findTables(dbName)

    results = []
    for t in tqdm(sorted(schema.keys())):

        conn = psycopg2.connect("dbname='%s' user='postgres' host='localhost'"%dbName)
        cur  = conn.cursor()
        cur.execute('select * from %s'%t)
        n = 0
        N = 0
        while True:
            css = cur.fetchmany(1000)
            if css == []: break
            for cs in css:
                N += 1
                if any(['\n' in c for c in cs if type(c)==str]):
                    n += 1
        cur.close()
        conn.close()

        tqdm.write('[%40s] -> [%5d][%10d][%.4e]'%(t, n, N, n/(N+1.0)))
        results.append({
            'tableName': t,
            'totalRows': N,
            'badRows'  : n,
        })


    logger.info('Finished checking for multiple lines')

    results = pd.DataFrame(results)[['tableName', 'badRows', 'totalRows']]
    print results
    results.to_csv('error_MultipleLine[%s].csv'%(dbName), index=False)

    return results
ssm
  • 5,277
  • 1
  • 24
  • 42
  • How did you managed to get more than one python process? Does psycopg2 do that? – hurturk Mar 10 '17 at 05:12
  • I am guessing it is psycopy doing that. I haven't done anything special with my code to initiate new processes .... – ssm Mar 10 '17 at 05:15
  • I have found an option to limit the buffer on following [answer](http://stackoverflow.com/a/28343332/1233686), have you tried this one? – hurturk Mar 10 '17 at 05:18
  • Let me check. If this works, you can submit an answer and Ill accept the answer. – ssm Mar 10 '17 at 05:25
  • I think you just solved my problem! The memory profile is really small now and remarkably stable. I will be happy to accept your answer if you write it up. – ssm Mar 10 '17 at 05:37
  • Cool! TIL another psycopg2 feature, I will add it shortly. – hurturk Mar 10 '17 at 05:38

1 Answers1

2

Psycopg2 supports server-side cursors to be used for large queries as stated in this answer. Here is how to use it with client-side buffer setting:

cur = conn.cursor('cursor-name')
cur.itersize = 10000  # records to buffer on a client

That should reduce the memory footprint.

Community
  • 1
  • 1
hurturk
  • 5,214
  • 24
  • 41