0

I have a table for experimental protocols which includes foreign keys for a number of other tables (most prominently it includes a number of Incubation entries). The structure looks like this, verbatim:

class DNAExtractionProtocol(Base):
    __tablename__ = 'dna_extraction_protocols'
    id = Column(Integer, primary_key=True)
    code = Column(String, unique=True)
    name = Column(String)
    sample_mass = Column(Float)
    mass_unit_id = Column(String, ForeignKey('measurement_units.id'))
    mass_unit = relationship("MeasurementUnit", foreign_keys=[mass_unit_id])
    digestion_buffer_id = Column(String, ForeignKey("solutions.id"))
    digestion_buffer = relationship("Solution", foreign_keys=[digestion_buffer_id])
    digestion_buffer_volume = Column(Float)
    digestion_id = Column(Integer, ForeignKey("incubations.id"))
    digestion = relationship("Incubation", foreign_keys=[digestion_id])
    lysis_buffer_id = Column(String, ForeignKey("solutions.id"))
    lysis_buffer = relationship("Solution", foreign_keys=[lysis_buffer_id])
    lysis_buffer_volume = Column(Float)
    lysis_id = Column(Integer, ForeignKey("incubations.id"))
    lysis = relationship("Incubation", foreign_keys=[lysis_id])
    proteinase_id = Column(String, ForeignKey("solutions.id"))
    proteinase = relationship("Solution", foreign_keys=[proteinase_id])
    proteinase_volume = Column(Float)
    inactivation_id = Column(Integer, ForeignKey("incubations.id"))
    inactivation = relationship("Incubation", foreign_keys=[inactivation_id])
    cooling_id = Column(Integer, ForeignKey("incubations.id"))
    cooling = relationship("Incubation", foreign_keys=[cooling_id])
    centrifugation_id = Column(Integer, ForeignKey("incubations.id"))
    centrifugation = relationship("Incubation", foreign_keys=[centrifugation_id])

    volume_unit_id = Column(String, ForeignKey('measurement_units.id'))
    volume_unit = relationship("MeasurementUnit", foreign_keys=[volume_unit_id])

Now, given the unique code attribution, I would like to get a Pandas dataframe (or rather a Series) which allows me to select not only any of the attributes of the corresponding entry in the "dna_extraction_protocols" table, but also in the related tables.

I am currently selecting a pandas dataframe with:

sql_query = session.query(DNAExtractionProtocol).join(DNAExtractionProtocol.digestion_buffer).filter(DNAExtractionProtocol.code == code)
for item in sql_query:
    pass
mystring = str(sql_query)
mydf = pd.read_sql_query(mystring,engine,params=[code])
print(mydf)

But this only allows me to select the IDs of the related keys. I can select mydf["dna_extraction_protocols_mass_unit_id"] - but I would also like to be able to select mydf["dna_extraction_protocols_mass_unit_long_name"], given the following available keys on the "measurement_units" table:

class MeasurementUnit(Base):
    __tablename__ = "measurement_units"
    id = Column(Integer, primary_key=True)
    code = Column(String, unique=True)
    long_name = Column(String)
    siunitx = Column(String)
TheChymera
  • 17,004
  • 14
  • 56
  • 86

2 Answers2

1

The classic answer to such questions is to create a SQL VIEW.

Views are like dynamic virtual tables - in queries you use the view name instead of a table name, and the DBMS runs the query defined by the view to produce the rows for the query on the view. You therefore see rows based on the data from the tables at the time you access the view, not at the time the view was created.

You would create this view with a statement such as

CREATE VIEW PROT_WITH_UNITS AS
  SELECT * FROM dna_extraction_protocols P
           JOIN measurement_units M
             ON P.volume_unit = M.id

This will give you a view with all columns of both tables, pre-joined on (what I presume to be) the required foreign key.

If you get the definition wrong you can drop views just like tables, so you should get there eventually.

holdenweb
  • 33,305
  • 7
  • 57
  • 77
  • thank you, that's already a big step forward - but the `measurement_units` table was just one example. I would like to join all tables which my entry links to - prefferably withourt having to specify them by hand. Is this possible? – TheChymera Mar 21 '16 at 17:57
  • It seems this querry works: `sql_query = session.query(tables[table]).options(eagerload(tables[table].digestion_buffer)).filter(tables[table].code == code)` but how can I get around having to secify each related table? – TheChymera Mar 21 '16 at 18:19
  • You specify all necessary tables in a huge select statement in the view. Once. If faster data access is more important than flexibility you can _materialize_ the view by retrieving it and storing its result in a large table. – holdenweb Mar 21 '16 at 22:10
0

The most pythonic way to deal with this which I could yet come across is based on the response to a related question. It seems one can use SQLAlchemy's introspect module. One caveat that still remains in this is that for every level of related tables you want to join, you would need to add a nested for loop for the lowest inspection call.

Here is an example which joins: * all related tables of the main table * all related tables of all related tables of the main table:

cols = []
joins = []
insp = inspection.inspect(DNAExtractionProtocol)
for name, col in insp.columns.items():
    cols.append(col.label(name))
for name, rel in insp.relationships.items():
    alias = aliased(rel.mapper.class_, name=name)
    joins.append((alias, rel.class_attribute))
    for col_name, col in inspection.inspect(rel.mapper).columns.items():
        #the id column causes double entries, as it is mapped once on the parent table (related_table_id) and once on the child table (table_id)
        if col.key != "id":
            aliased_col = getattr(alias, col.key)
            cols.append(aliased_col.label("{}_{}".format(name, col_name)))

    sub_insp = inspection.inspect(rel.mapper.class_)
    for sub_name, sub_rel in sub_insp.relationships.items():
        if "contains" not in sub_name:
            sub_alias = aliased(sub_rel.mapper.class_, name=name+"_"+sub_name)
            joins.append((sub_alias, sub_rel.class_attribute))
            for sub_col_name, sub_col in inspection.inspect(sub_rel.mapper).columns.items():
                print(sub_alias, sub_col.key, '###')
                #the id column causes double entries, as it is mapped once on the parent table (related_table_id) and once on the child table (table_id)
                if sub_col.key != "id":
                    sub_aliased_col = getattr(sub_alias, sub_col.key)
                    cols.append(sub_aliased_col.label("{}_{}_{}".format(name, sub_name, sub_col_name)))

sql_query = session.query(*cols).select_from(DNAExtractionProtocol)
for join in joins:
    sql_query = sql_query.outerjoin(*join)
sql_query = sql_query.filter(DNAExtractionProtocol.code == code)

I had to add a hack to exclude ID columns, as those cause duplicate column names for the preferred naming scheme - but these columns can also be kept if the naming scheme is instead edited.

Community
  • 1
  • 1
TheChymera
  • 17,004
  • 14
  • 56
  • 86