Pandas: how to work with really big data?

Question

My sample of data is really big (1.2 million documents), and I need to create and analyse data on only one "pandas dataframe". For now my code looks like this:

conn = psycopg2.connect("dbname=monty user=postgres host=localhost password=postgres")
cur = conn.cursor('aggre')
cur.execute("SELECT * FROM binance.zrxeth_ob_indicators;")
row = cur.fetchall()
df = pd.DataFrame(row,columns = ['timestamp', 'topAsk', 'topBid', 'CPA', 'midprice', 'CPB', 'spread', 'CPA%', 'CPB%'])

But it will take ages to localy upload everything in the variable df? What I tried so far was to do this:

for row in cur:
      dfsub = pd.DataFrame(row,columns=['timestamp', 'topAsk', 'topBid', 'CPA', 'midprice', 'CPB', 'spread', 'CPA%', 'CPB%'])
      df = df.concat([df,dfsub])

but it gives me the following error: DataFrame constructor not properly called!

any idea? Thanks!

You may want to check this answer: https://stackoverflow.com/questions/18107953/how-to-create-a-large-pandas-dataframe-from-an-sql-query-without-running-out-of — Denis Rasulev, Jan 07 '19 at 13:22
Even if your snippet worked, `df.concat([df,dfsub])` would be extremely expensive. It reallocates the whole df on every iteration. — Mad Physicist, Jan 07 '19 at 13:24
look into this: http://docs.dask.org/en/latest/dataframe.html — Binyamin Even, Jan 07 '19 at 13:30

score 0 · Answer 1 · answered Jan 07 '19 at 13:21

you can do something like this

class Postgres:
def __init__(self, host, database, user=None, password='', schema='public'):
    self.user = user or getpass.getuser()
    self.database = database
    self.host = host
    self.engine = self.create_engine(self.host, self.database, self.user, password)
    self.schema = schema

@staticmethod
def create_engine(host, database, user, password):
    return psycopg2.connect("postgresql://{user}:{password}@{host}/{database}".format(
        host=host,
        database=database,
        user=user,
        password=password
    ))

def execute(self, query: object) -> object:
    """
    :param query:
    :return: pd.Dataframe()
    """
    result_df = pd.read_sql(query, self.engine)
    self.engine.commit()
    return result_df

with this you use the optimized DataFrame creation from a postgres result of pandas.

But reasoned by your dataset it takes some time to read all the data into memory

score 0 · Accepted Answer · answered Jan 07 '19 at 13:39

Pandas has a nice builtin read_sql method that should be pretty efficient

i.e. just do:

df = pd.read_sql("SELECT * FROM binance.zrxeth_ob_indicators", conn)

and it should just work…

on its own 1.2 million rows isn't much, given your column count/names it's probably <300MB of RAM (30 bytes per value * 9 columns * 1.2e6 rows) and should take <10 seconds on a recent computer

score -1 · Answer 3 · answered Jan 07 '19 at 13:20

I think that since your document set is so large, it will take a long time to load it into memory no matter how you go about it. I would suggest, if you do not need to hold the entire dataset in memory at once, you could use the pandas inbuilt load chunk method. This allows you to sequentially load and process chunks of your data designed for this use case.

See this question for example How to read a 6 GB csv file with pandas

Pandas: how to work with really big data?

3 Answers3