I am trying to read feature vector of 2048 dimension (1 miilion records) from cassandra to pandas dataframe it crashes every time.
I have 32 GB ram but still i am not able to read all data into memory, my python program crashes every time i try to load data in memory. I need all data in memory at once for my machine learning algorithm. (My data size in csv is 18GB. )
import pandas as pd
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from cassandra.query import dict_factory
auth_provider = PlainTextAuthProvider(username=CASSANDRA_USER, password=CASSANDRA_PASS)
cluster = Cluster(contact_points=[CASSANDRA_HOST], port=CASSANDRA_PORT,
auth_provider=auth_provider)
session = cluster.connect(CASSANDRA_DB)
session.row_factory = dict_factory
query = "SELECT * FROM Table"
df = pd.DataFrame()
for row in session.execute(query):
df = df.append(pd.DataFrame())
Is it a right approach to read data in pandas dataframe ? Any other memory efficient way to read all the data in dataframe?
Options i am considering as last try : 1) Reduce Feature vector dimension 2) Increase ram memory
I can not store data in csv or any other file system as i have some other operations to do on data in cassandra.
The program gets crash every time with message as Killed which is causing by memory issue.