Duplicates, and strange values after merging dataframes

Question

I am using Python 3.

I have a master dataframe " df " with the columns as shown (with 3 rows of sample data):

UNITID  CIPCODE AWLEVEL CTOTALT
100654  1.0999  5        9
100654  1.1001  5        10
100654  1.1001  7        6

I have a dataframe called " uni_names " as shown (with 3 rows of sample data):

UNITID    institution_name
100654    Alabama A & M University
100663    University of Alabama at Birmingham
100690    Amridge University

I have a dataframe called " cipcodes " as shown (with 3 rows of sample data):

cipcode_value   program_name
01.0000         Agriculture, General
01.0101         Agricultural Business and Management, General
01.0102         Agribusiness/Agricultural Business Operations

I have a dataframe called " awlevel " as shown (with 3 rows of sample data):

code    type    
3       Associate's degree
5       Bachelor's degree
7       Master's degree

What I want is an output dataframe with column names as such

institution_name    program_name    type    CTOTALT

My code below is giving duplicates and weird additional values:

import pandas as pd

# Read the master dataframe from a CSV file

df = pd.read_csv('master_data.csv')

# Read the uni_names dataframe from a CSV file

uni_names = pd.read_csv('uni_names.csv')

# Read the cipcodes dataframe from a CSV file

cipcodes = pd.read_csv('cipcodes.csv')

# Read the awlevel dataframe from a CSV file

awlevel = pd.read_csv('awlevel.csv')

# Merge df with uni_names based on UNITID

merged_df = df.merge(uni_names, on='UNITID')

# Merge merged_df with cipcodes based on CIPCODE

merged_df = merged_df.merge(cipcodes, left_on='CIPCODE', right_on='cipcode_value')

# Merge merged_df with awlevel based on AWLEVEL

merged_df = merged_df.merge(awlevel, left_on='AWLEVEL', right_on='code')

# Select the desired columns and assign new column names

output_df = merged_df[['institution_name', 'program_name', 'type', 'CTOTALT']]

output_df.columns = ['institution_name', 'program_name', 'type', 'CTOTALT']

# Print the output dataframe

print(output_df)

score 1 · Answer 1 · answered May 18 '23 at 12:45

Seems like you have 1-1 relationships. In this case, use map

uni_mapper = uni_names.set_index('UNITID')['institution_name']

master_df['institution_name'] = master_df['UNITID'].map(uni_mapper)

Same thing for the other columns, e.g.

master_df['program_name'] = master_df['CIPCODE'].map(cipcode_mapper)

score 0 · Answer 2 · answered May 18 '23 at 18:20

0

actually there were two colleges by the same name! So a bit of data exploration revealed that there were no duplicates

answered May 18 '23 at 18:20

analyst92

243
1
6

Duplicates, and strange values after merging dataframes

2 Answers2