How to merge pandas dataframe without creating multiple duplicate rows

Question

I have two dataframes in pandas I wish to merge, which I can do using the following line data_cords = data_0_0.merge(data, on= "unique_id", how = "left") I get the desired result in terms of all the variables I want together are present in the data_cords df.

The problem is my method creates many exact duplicate rows. To get my desired end product I use df = data_cords.drop_duplicates() but all of this is very expensive memory wise which is an issue as I run the code on google colab. Is there a way I can do the merge without creating all the duplicate rows?

I have inserted screenshots of ach dataframe to the end of the question to add clarity. apologies if this is the incorrect format I am relatively new here.

df data_0_0 looks like this:

& df data looks like this:

df data_cords ends up like this like this with the desired columns added to the end of each sequence:

It is difficult to suggest without looking at the data (at least partially). Did you try other kinds of merge? Like ``how='inner'``? — Maxim Ivanov, Feb 16 '21 at 18:34
@MaximIvanov I have added some screen shots for clarity. To answer your question yes I did but I couldn't get it to work and my method was fine when I was just using a small subsample of my data but now it has this obvious drawback. — Sean, Feb 16 '21 at 18:54

score 0 · Answer 1 · answered Feb 16 '21 at 21:37

0

I actually found the answer in this earlier post here

For my particular code the solution is to run data.drop_duplicates() before running data_cords = data_0_0.merge(data, on= "unique_id", how = "left")

answered Feb 16 '21 at 21:37

Sean

119
6

How to merge pandas dataframe without creating multiple duplicate rows

1 Answers1