1

Given the following data:

Class       Name
======      =============
Math        John Smith
-------------------------
Math        Jenny Simmons
-------------------------
English     Sarah Blume
-------------------------
English     John Smith
-------------------------
Chemistry   Roger Tisch
-------------------------
Chemistry   Jenny Simmons
-------------------------
Physics     Sarah Blume
-------------------------
Physics     Jenny Simmons

I have a list of classes and names in each, like so:

[
{class: 'Math', student: 'John Smith'},
{class: 'Math', student: 'Jenny Simmons'},
{class: 'English', student: 'Sarah Blume'},
{class: 'English', student: 'John Smith'},
{class: 'Chemistry', student: 'John Smith'},
{class: 'Chemistry', student: 'Jenny Simmons'},
{class: 'Physics', student: 'Sarah Blume'},
{class: 'Physics', student: 'Jenny Simmons'},
]

I'd like to create an adjacency matrix, which would, as input, have the following structure, showing the number of students in common between each pair of classes:

enter image description here

How would I be able to do so in python/pandas in the most performant manner? I've got ~19M of these class/student pairs (~240MB) in my list.

jottbe
  • 4,228
  • 1
  • 15
  • 31
scrollex
  • 2,575
  • 7
  • 24
  • 38
  • See the section on crosstab https://stackoverflow.com/questions/47152691/how-to-pivot-a-dataframe – piRSquared Sep 03 '19 at 22:10
  • 1
    @piRSquared: the linked article surely covers some part of the question, but not all of it, since he doesn't have the data in the structure which your method builds on. So I'm not sure if marking it a duplicate is justifiable. – jottbe Sep 03 '19 at 23:27
  • 2
    @jottbe ok. I reopened it – piRSquared Sep 04 '19 at 00:06

1 Answers1

1

You can prepare the data for the adjacency matrix like this:

# create the "class-tuples" by
# joining the dataframe with itself
df_cross= df.merge(df, on='student', suffixes=['_left', '_right'])
# remove the duplicate tuples
# --> this will get you a upper / or lower
# triangular matrix with diagonal = 0
# if you rather want to have a full matrix
# just change the >= to == below
del_indexer= (df_cross['class_left']>=df_cross['class_right'])
df_cross.drop(df_cross[del_indexer].index, inplace=True)
# create the counts / lists
grouby_obj= df_cross.groupby(['class_left', 'class_right'])
result= grouby_obj.count()
result.columns= ['value']
# if you want to have lists of student names
# that have the course-combination in
# common, you can do it with the following line
# otherwise just remove it (I guess with a 
# dataset of the size you mentioned, it will
# consume a lot of memory)
result['students']= grouby_obj.agg(list)

The full output looks like this:

Out[133]: 
                        value                     students
class_left class_right                                    
Chemistry  English          1                 [John Smith]
           Math             2  [John Smith, Jenny Simmons]
           Physics          1              [Jenny Simmons]
English    Math             1                 [John Smith]
           Physics          1                [Sarah Blume]
Math       Physics          1              [Jenny Simmons]

You then can use @piRSquared's method to pivot it, or do it like this:

result['value'].unstack()

Out[137]: 
class_right  English  Math  Physics
class_left                         
Chemistry        1.0   2.0      1.0
English          NaN   1.0      1.0
Math             NaN   NaN      1.0

Or, if you also want the names:

result.unstack()
Out[138]: 
              value                   students                                              
class_right English Math Physics       English                         Math          Physics
class_left                                                                                  
Chemistry       1.0  2.0     1.0  [John Smith]  [John Smith, Jenny Simmons]  [Jenny Simmons]
English         NaN  1.0     1.0           NaN                 [John Smith]    [Sarah Blume]
Math            NaN  NaN     1.0           NaN                          NaN  [Jenny Simmons]
jottbe
  • 4,228
  • 1
  • 15
  • 31