0

Primitive programmer here. I have been tasked with cleaning medical data which is stored in csv format.

(please keep in mind while you read this that I am just a beginner programmer so your patience is appreciated)

I have a file, we'll call it data1, which looks like this: data1. It has ~17,000 rows/patients

inc_key refers to a unique patient ID.

I have another file, which we'll call data2, which is identical in format except with different information stored in it, however it contains MILLIONS of rows/patients.

My goal is, for each row/patient in data1, I need to find the matching patient (inc_key value) in data2, and then append (add columns to the end of that patient) the corresponding information to the same patient in data1.

In other words, I need to merge these two files, except the inc_key values need to match.

I am using the pandas module, can anyone help me with this?

Thank you in advance to anyone who helps, it is sincerely appreciated since I am only a beginner programmer.

1 Answers1

0

You are looking for merge,

Docs here : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

You can merge the data frames like this,

data1.merge(data2, on=['inc_key'], how='left')

If you are okay with data loss if the inc_key can't be found in data2, go with inner join.

You can also select just the columns you need from data2 and join like this,

data1.merge(data2[list_of_columns + ['inc_key']], on=['inc_key'], how='left')
Sreeram TP
  • 11,346
  • 7
  • 54
  • 108