-1

I have written the following code to load data from a postgres data base and do some operations on it. There are about 1million rows and the kernel keeps dying. When i limit the data size to about 10k, it works.

import psycopg2
import sys, os
import numpy as np
import pandas as pd
import creds as creds
import pandas.io.sql as psql


## ****** LOAD PSQL DATABASE ***** ##
# Sets up a connection to the postgres server.
conn_string = "host="+ creds.PGHOST +" port="+ "5432" +" dbname="+ creds.PGDATABASE +" user=" + creds.PGUSER \
+" password="+ creds.PGPASSWORD
conn=psycopg2.connect(conn_string)
print("Connected!")

# Create a cursor object
cursor = conn.cursor()


sql_command = "SELECT * FROM {};".format(str("events"))
print (sql_command)

# Load the data
data = pd.read_sql(sql_command, conn)

# taking a subet of the data until algorithm is perfected. 
# seed = np.random.seed(42)

# n = data.shape[0]
# ix = np.random.choice(n,10000)
# df_tmp = data.iloc[ix]

# Taking the source and destination and combining it into a list in another column 
# df_tmp['accounts'] = df_tmp.apply(lambda x: [x['source'], x['destination']], axis=1)
data['accounts'] = data.apply(lambda x: (x['source'], x['destination']), axis=1)
data['accounts_acc'] = data['accounts'].cumsum().apply(set)

Is there a more efficient way of doing this that doesnt keep failing?

0xsegfault
  • 2,899
  • 6
  • 28
  • 58
  • how many columns are you bringing back? because 1M Rows is fine. Also where is it failing? Is it on the Postgres side? – MattR Sep 03 '19 at 19:01
  • I guess the problem is with the "apply" method because it consumes a lot of memory. Try to replace it. – J.K Sep 03 '19 at 19:03
  • It’s failing on the pandas side. I doesn’t have that many columns just 6 and I can trim it down. The apply method is taking two tables and combines the values into a tuple and storing it in another column – 0xsegfault Sep 03 '19 at 19:05
  • @J.K any ideas on what to change it with? – 0xsegfault Sep 03 '19 at 19:07
  • Check my answer below. – J.K Sep 03 '19 at 19:43

1 Answers1

2

I guess the problem is with the "apply" method because it consumes a lot of memory.

Try to replace it with :

data['accounts'] = [(t.source, t.destination) for t in data.itertuples()]

Let's try to test a Dataframe with 600,000 rows and 4 columns

Memory performance :

%memit df['accounts1'] = df.apply(lambda x: (x['col1'], x['col2']), axis=1)

peak memory: 506.66 MiB, increment: 114.62 MiB

%memit run_loop()

peak memory: 475.82 MiB, increment: 82.15 MiB

%memit df['accounts2'] = [(t.col1, t.col2) for t in df.itertuples()]

peak memory: 430.07 MiB, increment: 38.02 MiB

def run_loop():
    new_col = []
    for i, row in df.iterrows():
        result = str(row.col1)+","+str(row.col2)
        new_col.append(result)  

Time performance:

%timeit df['accounts1'] = df.apply(lambda x: (x['col1'], x['col2']), axis=1)

9.93 s ± 345 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df['accounts2'] = [(t.col1, t.col2) for t in df.itertuples()]

598 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

J.K
  • 1,178
  • 10
  • 13
  • 1
    I think apply grown `O(n^2)` and bigger sets blow it up on this laptop. Thanks for the answer and in-depth analysis – 0xsegfault Sep 03 '19 at 20:17