My problem is a different variation of this GitHub discussion, also with this similar SO question, yet unanswered. Other similar questions don't really cover this specific problem.
I currently have a SQLAlchemy model mapped to a class that has about one hundred or more (!) columns. These are extracted from a series of files, but not all the files have all the columns present (there is a subset, but not all of them). Currently I defined statically all columns that might be there, but it cluttered the source and it's hard to maintain (what if a new version of the file has an additional column?)
Example of a simplified class:
Base = declarative_base()
class Variant(Base):
__tablename__ = "variants"
# Simplified common columns
variant_id = Column(Integer, primary_key=True)
location = Column(Text)
start = Column(Integer)
end = Column(Integer)
# Example: data can have *any* of the following columns
is_lof = Column(Boolean, nullable=True)
gene_id = Column(String, nullable=True)
aaf = Column(Float, default=-1)
In this simplified example, when reading from the data sources, I can have all three optional columns (is_lof
, gene_id
, and aaf
), two, or just one of them.
Following the GH discussion above, it's easy to create something that will generate the class dynamically:
def base_attributes():
data = dict()
data["variant_id"] = Column(Integer, primary_key=True)
data["location"] = Column(Text, nullable=False)
data["start"] = Column(Integer, nullable=False)
data["end"] = Column(Integer, nullable=False)
return data
table_def = base_attributes()
table_def["clsname"] = "Variant"
table_def["__tablename__"] = "variants"
typemap = {"String": Text, "Flag": Boolean}
# Structure from file, with two fields out of three
columns_to_infer = [{"ID": "is_lof", "Type": "Flag"}, {"ID": "gene_id", "Type": "String"}]
for definition in columns_to_infer:
type_ = typemap[definition["Type"]]
colname = definition["ID"]
table_def.update({colname: Column(type_)})
Variant = type(table_defs["clsname"], (Base,), table_def)
This creates the class nicely, but this can be done only at runtime when the files are read. Thus, I can't reference Variant
elsewhere in the code or in other modules, for example writing functions that take the database where these tables are created to do additional queries.
My goal would be having a Variant
class defined with the minimum required columns and then extend it at runtime when reading the data from the files, so that everything is then put in the database correctly. When reading from the database, autoloading if defined would take care of it.
To sum up what I'd like to have:
- Have a
Variant
class with the required columns (variant_id
,location
,start
,end
) - Extend
Variant
at runtime to add the columns defined in the files (one or more amongis_of
,gene_id
oraaf
) or when reading from the database
Is such an approach even feasible?