Pandas - matching values across different CSVs and then appending a column to original file

Question

Primitive programmer here. I have been tasked with cleaning medical data which is stored in csv format.

(please keep in mind while you read this that I am just a beginner programmer so your patience is appreciated)

I have a file, we'll call it data1, which looks like this: data1. It has ~17,000 rows/patients

inc_key refers to a unique patient ID.

I have another file, which we'll call data2, which is identical in format except with different information stored in it, however it contains MILLIONS of rows/patients.

My goal is, for each row/patient in data1, I need to find the matching patient (inc_key value) in data2, and then append (add columns to the end of that patient) the corresponding information to the same patient in data1.

In other words, I need to merge these two files, except the inc_key values need to match.

I am using the pandas module, can anyone help me with this?

Thank you in advance to anyone who helps, it is sincerely appreciated since I am only a beginner programmer.

you are looking for pandas merge. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html — Sreeram TP, Jun 29 '21 at 13:29
Does this answer your question? [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) — It_is_Chris, Jun 29 '21 at 13:30

Sreeram TP · Accepted Answer · 2021-06-29T13:57:54.487

0

You are looking for merge,

Docs here : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

You can merge the data frames like this,

data1.merge(data2, on=['inc_key'], how='left')

If you are okay with data loss if the inc_key can't be found in data2, go with inner join.

You can also select just the columns you need from data2 and join like this,

data1.merge(data2[list_of_columns + ['inc_key']], on=['inc_key'], how='left')

edited Jun 29 '21 at 13:57

answered Jun 29 '21 at 13:31

Sreeram TP

11,346
7
54
108

thank you so much! is there a way for me to merge only specified columns from data2? or must I merge all of them? – Sean Roudnitsky Jun 29 '21 at 13:39
you can select the columns you need from data2 if you want to – Sreeram TP Jun 29 '21 at 13:56

Pandas - matching values across different CSVs and then appending a column to original file

1 Answers1