10

I am trying to load an SQLAlchemy in a pandas DataFrame.

When I do:

df = pd.DataFrame(LPRRank.query.all())

I get

>>> df
0        <M. Misty || 1 || 18>
1        <P. Patch || 2 || 18>
...
...

But, what I want is each column in the database to be a column in the dataframe:

0        M. Misty  1  18
1        P. Patch  2  18
...
...

and when I try:

dff = pd.read_sql_query(LPRRank.query.all(), db.session())

I get an Attribute Error:

AttributeError: 'SignallingSession' object has no attribute 'cursor'

and

dff = pd.read_sql_query(LPRRank.query.all(), db.session)

also gives an error:

AttributeError: 'scoped_session' object has no attribute 'cursor'

What I'm using to generate the list of objects is:

app = Flask(__name__)
db = SQLAlchemy(app)

class LPRRank(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    candid = db.Column(db.String(40), index=True, unique=False)
    rank = db.Column(db.Integer, index=True, unique=False) 
    user_id = db.Column(db.Integer, db.ForeignKey('lprvote.id'))

    def __repr__(self):
        return '<{} || {} || {}>'.format(self.candid,
                                                 self.rank, self.user_id) 

This question: How to convert SQL Query result to PANDAS Data Structure? is error free, but gives each row as an object, which is not what I want. I can access the individual columns in the returned object, but its seems like there is a better way to do it.

The documentation at pandas.pydata.org is great if you already understand what is going on and just need to review syntax. The documentation from April 20, 2016 (the 1319 page pdf) identifies a pandas connection as still experimental on p.872.

Now, SQLALCHEMY/PANDAS - SQLAlchemy reading column as CLOB for Pandas to_sql is about specifying the SQL type. Mine is SQLAlchemy which is the default.

And, sqlalchemy pandas to_sql OperationalError, Writing to MySQL database with pandas using SQLAlchemy, to_sql, and SQLAlchemy/pandas to_sql for SQLServer -- CREATE TABLE in master db are about writing to the SQL database which produces an operational error, a database error, and a 'create table' error neither of which are my problems.

This one, SQLAlchemy Pandas read_sql from jsonb wants a jsonb attribute to columns: not my cup 'o tea.

This previous question SQLAlchemy ORM conversion to pandas DataFrame addresses my issue but the solution: using query.session.bind is not my solution. I'm opening /closing sessions with db.session.add(), and db.session.commit(), but when I use db.session.bind as specified in the second answer here, then I get an Attribute Error:

AttributeError: 'list' object has no attribute '_execute_on_connection'
David Collins
  • 848
  • 3
  • 10
  • 28

3 Answers3

23

Simply add an __init__ method in your model and call the Class object before dataframe build. Specifically below creates an iterable of tuples binded into columns with pandas.DataFrame().

class LPRRank(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    candid = db.Column(db.String(40), index=True, unique=False)
    rank = db.Column(db.Integer, index=True, unique=False) 
    user_id = db.Column(db.Integer, db.ForeignKey('lprvote.id'))

    def __init__(self, candid=None, rank=None, user_id=None):
        self.data = (candid, rank, user_id)

    def __repr__(self):
        return (self.candid, self.rank, self.user_id) 

data = db.session.query(LPRRank).all()
df = pd.DataFrame([(d.candid, d.rank, d.user_id) for d in data], 
                  columns=['candid', 'rank', 'user_id'])

Alternatively, use the SQLAlchemy ORM based on your defined Model class, LPRRank, to run read_sql:

df = pd.read_sql(sql = db.session.query(LPRRank)\
                         .with_entities(LPRRank.candid,
                                        LPRRank.rank,
                                        LPRRank.user_id).statement, 
                 con = db.session.bind)
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • thanks for pointing out the formatting. I took that directly from Miguel Grinberg's Flask tutorial, which is very good. His formatting mirrors the SQLAlchemy documentation, and that is consistent with the error I get with `df = pd.DataFrame(LPRRank.query.all(), columns=['candid', 'rank', 'user_id'])`: – David Collins Mar 11 '18 at 05:43
  • ... the error is `ValueError: Shape of passed values is (1, 113), indices imply (3, 113)`. It's looking for a string, but getting a tuple. Regarding the second solution, I get an: `sqlalchemy.exc.ObjectNotExecutableError: Not an executable object: ` – David Collins Mar 11 '18 at 05:54
  • See update using `__init__` in above call and add `.statement` to *sql* arg of below call. Both tested with no errors. – Parfait Mar 11 '18 at 22:30
  • They both work. Thanks! I particularly like being able to do it without looping, even in a list comprehension. However the `read_sql` includes the SQLAlchemy ID thereby creating a redundancy with the indexing of the DataFrame. I can eliminate the redundancy by manipulating the DataFrame, but can it be withheld in the `read_sql` command? – David Collins Mar 12 '18 at 22:02
  • Awesome! And indeed you can select specific columns using [`with_entities`](https://stackoverflow.com/a/12054872/1422451). See update. Or use `index_col='id'` arg in [`read_sql`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html) to select primary key field as index. – Parfait Mar 13 '18 at 01:14
1

The Parfait answer is good but could have to problems:

  1. efficiency each object creation imply duplication of data into a DataFrame, so a list of dataframe could take time to be created
  2. That do not mirror a dataframe with a collection of row

Thus below example provides a parent class which is assimilated to a DataFrame representation and a child class assimilated to row of a given dataframe.

Code below provides two way to get a dataframe, the DataFrame object is created only at demand to not waste cpu and memory.

If dataframe is need at creation time you have only to add constructor (def __init__(self, rows:List[MyDataFrameRow] = None)...) and create a new attribute and assing the result of self.data_frame.

from pandas import DataFrame, read_sql
from sqlalchemy import Column, Integer, String, Float, ForeignKey
from sqlalchemy.orm import relationship, Session

Base = declarative_base()

class MyDataFrame(Base):
    __tablename__ = 'my_data_frame'
    id = Column(Integer, primary_key=True)
    rows = relationship('MyDataFrameRow', cascade='all,delete')

    @property
    def data_frame(self) -> DataFrame:
        columns = GenomeCoverageRow.data_frame_columns()
        return DataFrame([[getattr(row, column) for column in columns] for row in self.rows],
                         columns=columns)

    @staticmethod
    def to_data_frame(identifier: int, session: Session) -> DataFrame:
        query = session.query(MyDataFrameRow).join(MyDataFrame).filter(MyDataFrame.id == identifier)
        return read_sql(query.statement, session.get_bind())


class MyDataFrameRow(Base):

    __tablename__ = 'my_data_row'
    id = Column(Integer, primary_key=True)
    name= Column(String)
    age= Column(Integer)
    number_of_children = Column(Integer)
    height= Column(Integer)
    parent_id = Column(Integer, ForeignKey('my_data_frame.id'))

    @staticmethod
    def data_frame_columns() -> Tuple[Any]:
        return tuple(column.name for column in GenomeCoverageRow.__table__.columns if len(column.foreign_keys) == 0
                     and column.primary_key is False)
...
session = Session(...)
df1 = MyDataFrame.to_data_frame(1,session)
my_table_obj = session.query(MyDataFrame).filter(MyDataFrame.id == 1).one()
df2 = my_table_obj.data_frame
bioinfornatics
  • 1,749
  • 3
  • 17
  • 36
0

I'm using flask-sqlalchemy with reflection to build my models but this worked for me:

import pandas as pd
from app.models import Runs
from app import db

def get_db_data_df():
    df_runs = pd.read_sql(Runs.__table__.name, con=db.get_engine(), index_col=None)
    return df_runs
Louise
  • 1