Nested queries / comparing multiple datasets efficiently in Pandas

Question

I am using Pandas (first time) to determine whether personnel meet prerequisites when it comes to course attendance. The code below returns the desired results however I am sure there are much better ways to achieve the same outcome.

The criteria to determine if you can undertake physics is as follows;

(Math_A OR Math_B OR Math_C) AND (Eng_A OR Eng_B) AND NOT (Physics)

My question is what efficiencies or alternate methods can be applied to achieve the task.

Reading up on nested queries and alike, I am not able to come up with a way to compare multiple queries in the one query. Ideally, I'd like to have one statement that checks if the person satisfies the prerequisites however I have failed at this so far.

Dataset - Will normally contain > 20,000 records

Emplid,Name,CourseId,CourseName
123,David,P12,Mathematics A
123,David,P13,Mathematics B
123,David,P14,Mathematics C
123,David,P32,Engineering A
456,Sally,P33,Engineering B
789,Roger,P99,Physics

Code

Revised to simplify readability - Thx Boud.

import pandas as pd

def physics_meets_prereqs():

    df = pd.DataFrame({'Emplid':['123','123', '123', '123', '456', '789'],
                   'Name':['David','David','David','David','Sally','Roger'],
                   'CourseId':['P12','P13','P14','P32','P33','P99'],
                   'CourseName':['Mathematics A','Mathematics B','Mathematics C','Engineering A','Engineering B', 'Physics']
                    })

    # Get datasets of individually completed courses
    has_math = df.query('CourseId == "P12" or CourseId == "P13" or CourseId == "P14"')
    has_eng = df.query('CourseId == "P32" or CourseId == "P33"')
    has_physics = df.query('CourseId == "P99"')

    # Get personnel who have completed math and engineering
    has_math_and_eng = has_math[(has_math['Emplid'].isin(has_eng['Emplid']))]

    # Remove personnel who have completed physics
    has_math_and_eng_no_physics = has_math_and_eng[~(has_math_and_eng['Emplid'].isin(has_physics['Emplid']))]

    print(has_math_and_eng_no_physics)

physics_meets_prereqs()

Output

  CourseId     CourseName Emplid   Name
0      P12  Mathematics A    123  David
1      P13  Mathematics B    123  David
2      P14  Mathematics C    123  David

The output is resulting in David being identified as meeting the prerequisites for the physics course. It does list him 3 times which I have not figured out how to limit as yet. The way I am achieving this though can definitely be improved.

In a nutshell

Show me a list of people who have completed at least one of the maths courses, at least one of the engineering courses and have not yet completed the physics course.

score 4 · Accepted Answer · answered Nov 12 '16 at 23:39

4

pivoted = df.groupby(['Name', df.CourseName.str.split().str[0]]) \
            .CourseId.size().gt(0).unstack(fill_value=False)

pivoted

matches = pivoted.query('Engineering & Mathematics & ~Physics')
matches

df.query('Name in @matches.index')

answered Nov 12 '16 at 23:39

piRSquared

285,575
57
475
624

Oh my god. You are a wizard! With some tweeking that solves a whole bunch of questions I had with one swoop. Thank you so much! – Dan Nov 12 '16 at 23:57
3

No he just likes rubikscube ;-) – Zeugma Nov 12 '16 at 23:58

Zeugma · Answer 2 · 2016-11-12T22:48:12.910

1

Use query to type more natural math relationships :

df.query('CourseId == "P12" or CourseId != "P99"')

edited Nov 12 '16 at 22:48

answered Nov 12 '16 at 22:39

Zeugma

31,231
9
69
81

if CourseId is P12 it will always not equal P99. Pandas works on rows, so how will your suggestion improve anything? – Dan Nov 12 '16 at 22:44
1

Changed and into or. Away from PC now so cannot run code. My point is what you re searching for is probably more flexible and readable if you run it through query instead of creating cascading intermediate variables – Zeugma Nov 12 '16 at 22:49
That cleared it up. I see where you are coming from with the query readability. The reason I went with the initial way is that these criteria will be used multiple times throughout the program once I add the bunch of other courses. I figured it saved having to duplicate query text further down the track. Thanks for the suggestion and I agree that it is easier to read in your instance. – Dan Nov 12 '16 at 22:58

Nested queries / comparing multiple datasets efficiently in Pandas

2 Answers2

Linked