Given the following data:
Class Name
====== =============
Math John Smith
-------------------------
Math Jenny Simmons
-------------------------
English Sarah Blume
-------------------------
English John Smith
-------------------------
Chemistry Roger Tisch
-------------------------
Chemistry Jenny Simmons
-------------------------
Physics Sarah Blume
-------------------------
Physics Jenny Simmons
I have a list of classes and names in each, like so:
[
{class: 'Math', student: 'John Smith'},
{class: 'Math', student: 'Jenny Simmons'},
{class: 'English', student: 'Sarah Blume'},
{class: 'English', student: 'John Smith'},
{class: 'Chemistry', student: 'John Smith'},
{class: 'Chemistry', student: 'Jenny Simmons'},
{class: 'Physics', student: 'Sarah Blume'},
{class: 'Physics', student: 'Jenny Simmons'},
]
I'd like to create an adjacency matrix, which would, as input, have the following structure, showing the number of students in common between each pair of classes:
How would I be able to do so in python/pandas in the most performant manner? I've got ~19M of these class/student pairs (~240MB) in my list.