I want to read all data from a table with 10+ gb of data into a dataframe. When i try to read with read_sql
i get memory overload error. I want to do some processing on that data and update table with new data. How i can do this efficiently. My PC have 26gb of ram but data is max 11 gb of size, still i get memory overload error.
In Dask its taking so much time. Below is code.
import dateparser
import dask.dataframe as dd
import numpy as np
df = dd.read_sql_table('fbo_xml_json_raw_data', index_col='id', uri='postgresql://postgres:passwordk@address:5432/database')
def make_year(data):
if data and data.isdigit() and int(data) >= 0:
data = '20' + data
elif data and data.isdigit() and int(data) < 0:
data = '19' + data
return data
def response_date(data):
if data and data.isdigit() and int(data[-2:]) >= 0:
data = data[:-2] + '20' + data[-2:]
elif data and data.isdigit() and int(data[-2:]) < 0:
data = data[:-2] + '19' + data[-2:]
if data and dateparser.parse(data):
return dateparser.parse(data).date().strftime('%Y-%m-%d')
def parse_date(data):
if data and dateparser.parse(data):
return dateparser.parse(data).date().strftime('%Y-%m-%d')
df.ARCHDATE = df.ARCHDATE.apply(parse_date)
df.YEAR = df.YEAR.apply(make_year)
df.DATE = df.DATE + df.YEAR
df.DATE = df.DATE.apply(parse_date)
df.RESPDATE = df.RESPDATE.apply(response_date)