Duplicate columns when querying SQLAlchemy into Pandas DF?

Question

I'm building a python data library for analysis on top of a star schema database and am having trouble integrating pandas and sqlalchemy because of some duplicate column keys in the data frame.

Here's the classes:

class Student(Base):
    __tablename__ = 'DimStudent'

    id = Column('StudentKey', Integer, primary_key=True)
    srcstudentid = ('SrcStudentId', Integer)
    firstname = Column('FirstName', String)
    middlename = Column('MiddleName', String)
    lastname = Column('LastName', String)
    lep = Column('LimitedEnglishProficiency', String)
    frl = Column('FreeReducedLunch', String)
    sped = Column('SpecialEducation', String)

class School(Base):
    __tablename__ = 'DimSchool'

    id = Column('SchoolKey', Integer, primary_key=True)
    name = Column('SchoolName', String)
    district = Column('SchoolDistrict', String)
    statecode = Column('StateCode', String)

class StudentScore(Base):
    __tablename__ = 'FactStudentScore'

    studentkey = Column('StudentKey', Integer, ForeignKey('DimStudent.StudentKey'), primary_key=True)
    teacherkey = Column('TeacherKey', Integer, ForeignKey('DimTeacher.TeacherKey'), primary_key=True)    
    schoolkey = Column('SchoolKey', Integer, ForeignKey('DimSchool.SchoolKey'), primary_key = True)
    assessmentkey = Column('AssessmentKey', Integer, ForeignKey('DimAssessment.AssessmentKey'), primary_key=True)
    subjectkey = Column('SubjectKey', Integer, ForeignKey('DimSubject.SubjectKey'), primary_key=True)
    yearcyclekey = Column('YearCycleKey', Integer, ForeignKey('DimYearCycle.YearCycleKey'), primary_key=True)
    pointspossible = Column('PointsPossible', Integer)
    pointsreceived = Column('PointsReceived', Integer)

    student = relationship("Student", backref=backref('studentscore'))
    school = relationship("School", backref=backref('studentscore'))
    assessment = relationship("Assessment", backref='studentscore')
    teacher = relationship("Teacher", backref='studentscore')
    subject = relationship("Subject", backref='studentscore')
    yearcycle = relationship("YearCycle", backref='studentscore')

Whenever I query my data, I consistently come up with duplicate columns of data, for example, the school key in this ORM call and then build a dataframe from it.

school = session.query(StudentScore, School, Subject)\    
.join(StudentScore.school).join(StudentScore.subject)\
.filter(School.name.like('%Dever%'))\
.filter(Subject.code == 'Math')

 a = pd.read_sql(school.statement, school.session.bind)

This SO thread provides a nice transpose technique to remove the duplicate.

 a = a.T.drop_duplicates().T

However, I'm still running into an error when I interact with this dataframe within the IDE variable explorer. The error is: "Reindexing only valid with uniquely valued Index objects"

Any idea where the issue is?

as a temporary hack can you use a.reset_index(inplace=True)? — Gecko, Jun 11 '15 at 19:46
I haven't done sqlalchemy in a while, but do you actually need the .join(School)? — Gecko, Jun 11 '15 at 19:50
A dirty solution might be to just do a.drop_duplicates(inplace=true) http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.drop_duplicates.html — PlagTag, Jun 11 '15 at 20:13
the drop_duplicates was a no go, still encountering the same error. i'm not actually sure I need the .join(school); will remove that when i'm back in office with secure connection. the school join is already implicit within the classes. — AZhao, Jun 14 '15 at 23:28
the .join is a requirement. not sure what happens without them, as it doesn't throw an exception but the memory usage explodes and crashes my computer :( — AZhao, Jun 15 '15 at 15:12

score 1 · Accepted Answer · answered Jun 15 '15 at 15:29

1

Found the correct answer! Instead of the most simple:

 a = a.T.drop_duplicates().T

I instead used a groupby to remove the duplicates:

df.T.groupby(level=0).first().T

That said, I'm not sure the drivers of my original error were. Also the new line of code works 10-100x faster than the old one.

answered Jun 15 '15 at 15:29

AZhao

13,617
7
31
54

Have also found this: http://stackoverflow.com/questions/22115819/handling-duplicate-columns-in-pandas-dataframe-constructor-from-sqlalchemy-join Related question, but seems like they use the pure query return. Would love someone forever if they could propose a similar elegant solution for duplicates coming from the return. – AZhao Jun 29 '15 at 18:27

Duplicate columns when querying SQLAlchemy into Pandas DF?

1 Answers1