Working with Large operations that kill the kernel

Question

I have written the following code to load data from a postgres data base and do some operations on it. There are about 1million rows and the kernel keeps dying. When i limit the data size to about 10k, it works.

import psycopg2
import sys, os
import numpy as np
import pandas as pd
import creds as creds
import pandas.io.sql as psql


## ****** LOAD PSQL DATABASE ***** ##
# Sets up a connection to the postgres server.
conn_string = "host="+ creds.PGHOST +" port="+ "5432" +" dbname="+ creds.PGDATABASE +" user=" + creds.PGUSER \
+" password="+ creds.PGPASSWORD
conn=psycopg2.connect(conn_string)
print("Connected!")

# Create a cursor object
cursor = conn.cursor()


sql_command = "SELECT * FROM {};".format(str("events"))
print (sql_command)

# Load the data
data = pd.read_sql(sql_command, conn)

# taking a subet of the data until algorithm is perfected. 
# seed = np.random.seed(42)

# n = data.shape[0]
# ix = np.random.choice(n,10000)
# df_tmp = data.iloc[ix]

# Taking the source and destination and combining it into a list in another column 
# df_tmp['accounts'] = df_tmp.apply(lambda x: [x['source'], x['destination']], axis=1)
data['accounts'] = data.apply(lambda x: (x['source'], x['destination']), axis=1)
data['accounts_acc'] = data['accounts'].cumsum().apply(set)

Is there a more efficient way of doing this that doesnt keep failing?

how many columns are you bringing back? because 1M Rows is fine. Also where is it failing? Is it on the Postgres side? — MattR, Sep 03 '19 at 19:01
I guess the problem is with the "apply" method because it consumes a lot of memory. Try to replace it. — J.K, Sep 03 '19 at 19:03
It’s failing on the pandas side. I doesn’t have that many columns just 6 and I can trim it down. The apply method is taking two tables and combines the values into a tuple and storing it in another column — 0xsegfault, Sep 03 '19 at 19:05

J.K · Accepted Answer · 2019-09-03T19:58:50.957

I guess the problem is with the "apply" method because it consumes a lot of memory.

Try to replace it with :

data['accounts'] = [(t.source, t.destination) for t in data.itertuples()]

Let's try to test a Dataframe with 600,000 rows and 4 columns

Memory performance :

%memit df['accounts1'] = df.apply(lambda x: (x['col1'], x['col2']), axis=1)

peak memory: 506.66 MiB, increment: 114.62 MiB

%memit run_loop()

peak memory: 475.82 MiB, increment: 82.15 MiB

%memit df['accounts2'] = [(t.col1, t.col2) for t in df.itertuples()]

peak memory: 430.07 MiB, increment: 38.02 MiB

def run_loop():
    new_col = []
    for i, row in df.iterrows():
        result = str(row.col1)+","+str(row.col2)
        new_col.append(result)

Time performance:

%timeit df['accounts1'] = df.apply(lambda x: (x['col1'], x['col2']), axis=1)

9.93 s ± 345 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df['accounts2'] = [(t.col1, t.col2) for t in df.itertuples()]

598 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I think apply grown `O(n^2)` and bigger sets blow it up on this laptop. Thanks for the answer and in-depth analysis — 0xsegfault, Sep 03 '19 at 20:17

Working with Large operations that kill the kernel

1 Answers1

Linked