2

I want to read all data from a table with 10+ gb of data into a dataframe. When i try to read with read_sql i get memory overload error. I want to do some processing on that data and update table with new data. How i can do this efficiently. My PC have 26gb of ram but data is max 11 gb of size, still i get memory overload error.

In Dask its taking so much time. Below is code.

import dateparser
import dask.dataframe as dd
import numpy as np

df = dd.read_sql_table('fbo_xml_json_raw_data', index_col='id', uri='postgresql://postgres:passwordk@address:5432/database')
def make_year(data):
    if data and data.isdigit() and int(data) >= 0:
        data = '20' + data
    elif data and data.isdigit() and int(data) < 0:
        data = '19' + data
    return data

def response_date(data):
    if data and data.isdigit() and int(data[-2:]) >= 0:
        data = data[:-2] + '20' + data[-2:]
    elif data and data.isdigit() and int(data[-2:]) < 0:
        data = data[:-2] + '19' + data[-2:]
    if data and dateparser.parse(data):
        return dateparser.parse(data).date().strftime('%Y-%m-%d')

def parse_date(data):
    if data and dateparser.parse(data):
        return dateparser.parse(data).date().strftime('%Y-%m-%d')

df.ARCHDATE = df.ARCHDATE.apply(parse_date)
df.YEAR = df.YEAR.apply(make_year)
df.DATE = df.DATE + df.YEAR
df.DATE = df.DATE.apply(parse_date)
df.RESPDATE = df.RESPDATE.apply(response_date)
jpp
  • 159,742
  • 34
  • 281
  • 339
Naresh
  • 1,842
  • 2
  • 24
  • 36
  • Does a simple operation like `len(df)`, before the processing, work? Note that you asking Dask to guess at partition sizes here, based on the bytes size of the first 5 rows - you may want to be more specific with the other kwargs. – mdurant Nov 05 '18 at 17:19

2 Answers2

2

See here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html

See that chunksize arg? You can chunk your data so it fits into memory.

It will return a chunk reading object so you can apply operations iteratively over the chunks.

You can probably also incorporate multiprocessing as well.

This will add a layer of complexity since you're no longer working on the DataFrame itself but an object containing chunks.

Since you're using Dask this "should" apply. I'm not sure how Dask handles chunking. It's been a while since I touched Pandas/Dask compatbility.

Quentin
  • 700
  • 4
  • 10
0

The main issue seems to be the exclusive use of pd.Series.apply. But apply is just a row-wise Python-level loop. It will be slow in Pandas and Dask. For performance-critical code, you should favour column-wise operations.

In fact, dask.dataframe supports a useful subset of the Pandas API. Here are a couple of examples:-

Avoid string operations

Convert data to numeric types first; then perform vectorisable operations. For example:

dd['YEAR'] = dd['YEAR'].astype(int)
dd['YEAR'] = dd['YEAR'].mask(dd['YEAR'] >= 0, 20)
dd['YEAR'] = dd['YEAR'].mask(dd['YEAR'] < 0, 19)

Convert to datetime

If you have datetime strings in an appropriate format:

df['ARCHDATE'] = df['ARCHDATE'].astype('M8[us]')

See also dask dataframe how to convert column to to_datetime.

jpp
  • 159,742
  • 34
  • 281
  • 339