I tried to use Andrea's answer investigate some huge CSV files and was getting many-to-many for just about everything - even columns I was sure were 1-1. The problem was duplicates.
Here's a slightly modified version with a demo, and with a format that matches database terminology (and a description to remove ambiguity)
Firstly a clearer example
Doctors make many prescriptions which can each have several drugs prescribed, but each drug is made by one producer and each producer only makes one drug.
doctor prescription drug producer
0 Doctor Who 1 aspirin Bayer
1 Dr Welby 2 aspirin Bayer
2 Dr Oz 3 aspirin Bayer
3 Doctor Who 4 paracetamol Tylenol
4 Dr Welby 5 paracetamol Tylenol
5 Dr Oz 6 antibiotics Merck
6 Doctor Who 7 aspirin Bayer
Correct results from my functions below
Main changes to Andrea's:
- drop_duplicates on the pairs so that 1-1 does not get seen as many-many
- I put the results in a dataframe (see
report_df
in the function) to make it easier to read the results
- I reversed the logic to match UML terms (I'm staying out of the debate of set vs UML - this is just the way I wanted it)
column 1 column 2 cardinality description
0 doctor prescription 1-to-many each doctor has many prescriptions (some had 3)
1 doctor drug many-to-many doctors had up to 2 drugs, and drugs up to 3 d...
2 doctor producer many-to-many doctors had up to 2 producers, and producers u...
3 prescription doctor many-to-1 many prescriptions (max 3) to 1 doctor
4 prescription drug many-to-1 many prescriptions (max 4) to 1 drug
5 prescription producer many-to-1 many prescriptions (max 4) to 1 producer
6 drug doctor many-to-many drugs had up to 3 doctors, and doctors up to 2...
7 drug prescription 1-to-many each drug has many prescriptions (some had 4)
8 drug producer 1-to-1 1 drug has 1 producer and vice versa
9 producer doctor many-to-many producers had up to 3 doctors, and doctors up ...
10 producer prescription 1-to-many each producer has many prescriptions (some ha...
11 producer drug 1-to-1 1 producer has 1 drug and vice versa
WRONG results without drop-duplicate below
These are based on my modified copy of Andrea's aglo without the drop-duplicates.
You can see how the last row - doctor-to-drug - is many-to-many when it should be 1-1 - that explains my initial results (which are hard to debug with 1000s of records)
column 1 column 2 cardinality description
0 doctor prescription 1-to-many each doctor has many prescriptions (some had 3)
1 doctor drug many-to-many doctors had up to 3 drugs, and drugs up to 4 d...
2 doctor producer many-to-many doctors had up to 3 producers, and producers u...
3 prescription doctor many-to-1 many prescriptions (max 3) to 1 doctor
4 prescription drug many-to-1 many prescriptions (max 4) to 1 drug
5 prescription producer many-to-1 many prescriptions (max 4) to 1 producer
6 drug doctor many-to-many drugs had up to 4 doctors, and doctors up to 3...
7 drug prescription 1-to-many each drug has many prescriptions (some had 4)
8 drug producer many-to-many drugs had up to 4 producers, and producers up ...
9 producer doctor many-to-many producers had up to 4 doctors, and doctors up ...
10 producer prescription 1-to-many each producer has many prescriptions (some ha...
11 producer drug many-to-many producers had up to 4 drugs, and drugs up to 4...
New functions
from itertools import product
import pandas as pd
def get_relation(df, col1, col2):
# pair columns, drop duplicates (for proper 1-1), group by each column with
# the count of entries from the other column associated with each group
first_max = df[[col1, col2]].drop_duplicates().groupby(col1).count().max()[0]
second_max = df[[col1, col2]].drop_duplicates().groupby(col2).count().max()[0]
if first_max==1:
if second_max==1:
return '1-to-1', f'1 {col1} has 1 {col2} and vice versa'
else:
return 'many-to-1',f'many {col1}s (max {second_max}) to 1 {col2}'
else:
if second_max==1:
return '1-to-many', f'each {col1} has many {col2}s (some had {first_max})'
else:
return f'many-to-many', f'{col1}s had up to {first_max} {col2}s, and {col2}s up to {second_max} {col1}s'
def report_relations(df):
report = []
for col_i, col_j in product(df.columns, df.columns):
if col_i == col_j:
continue
relation = get_relation(df, col_i, col_j)
report.append([col_i, col_j, *relation])
report_df = pd.DataFrame(report, columns=["column 1", "column 2", "cardinality", "description"])
# formating
pd.set_option('display.max_columns', 1000, 'display.width', 1000, 'display.max_rows',1000)
# comment one of these two out depending on where you're using it
display(report_df) # for jupyter
print(report_df) # SO
test_df = pd.DataFrame({
'doctor': ['Doctor Who', 'Dr Welby', 'Dr Oz','Doctor Who', 'Dr Welby', 'Dr Oz', 'Doctor Who'],
'prescription': [1, 2, 3, 4, 5, 6, 7],
'drug': [ 'aspirin', 'aspirin', 'aspirin', 'paracetemol', 'paracetemol', 'antibiotics', 'aspirin'],
'producer': [ 'Bayer', 'Bayer', 'Bayer', 'Tylenol', 'Tylenol', 'Merck', 'Bayer']
})
display(test_df)
print(test_df)
report_relations(test_df)
Thank's Andrea - this helped me a lot.