1

My problem is a different variation of this GitHub discussion, also with this similar SO question, yet unanswered. Other similar questions don't really cover this specific problem.

I currently have a SQLAlchemy model mapped to a class that has about one hundred or more (!) columns. These are extracted from a series of files, but not all the files have all the columns present (there is a subset, but not all of them). Currently I defined statically all columns that might be there, but it cluttered the source and it's hard to maintain (what if a new version of the file has an additional column?)

Example of a simplified class:


Base = declarative_base()

class Variant(Base):
    __tablename__ = "variants"
    
    # Simplified common columns
    variant_id = Column(Integer, primary_key=True)
    location = Column(Text)
    start = Column(Integer)
    end = Column(Integer)

    # Example: data can have *any* of the following columns
    is_lof = Column(Boolean, nullable=True)
    gene_id = Column(String, nullable=True)
    aaf = Column(Float, default=-1)

In this simplified example, when reading from the data sources, I can have all three optional columns (is_lof, gene_id, and aaf), two, or just one of them.

Following the GH discussion above, it's easy to create something that will generate the class dynamically:

def base_attributes():
    
    data = dict()
    data["variant_id"] = Column(Integer, primary_key=True)
    data["location"] = Column(Text, nullable=False)
    data["start"] = Column(Integer, nullable=False)
    data["end"] = Column(Integer, nullable=False)

    return data

table_def = base_attributes()
table_def["clsname"] = "Variant"
table_def["__tablename__"] = "variants"

typemap = {"String": Text, "Flag": Boolean}

# Structure from file, with two fields out of three
columns_to_infer = [{"ID": "is_lof", "Type": "Flag"}, {"ID": "gene_id", "Type": "String"}]

for definition in columns_to_infer:
    type_ = typemap[definition["Type"]]
    colname = definition["ID"]
    table_def.update({colname: Column(type_)})

Variant = type(table_defs["clsname"], (Base,), table_def)

This creates the class nicely, but this can be done only at runtime when the files are read. Thus, I can't reference Variant elsewhere in the code or in other modules, for example writing functions that take the database where these tables are created to do additional queries.

My goal would be having a Variant class defined with the minimum required columns and then extend it at runtime when reading the data from the files, so that everything is then put in the database correctly. When reading from the database, autoloading if defined would take care of it.

To sum up what I'd like to have:

  • Have a Variant class with the required columns (variant_id, location, start, end)
  • Extend Variant at runtime to add the columns defined in the files (one or more among is_of, gene_id or aaf) or when reading from the database

Is such an approach even feasible?

Einar
  • 4,727
  • 7
  • 49
  • 64

0 Answers0