0

I have two dataframes in pandas I wish to merge, which I can do using the following line data_cords = data_0_0.merge(data, on= "unique_id", how = "left") I get the desired result in terms of all the variables I want together are present in the data_cords df.

The problem is my method creates many exact duplicate rows. To get my desired end product I use df = data_cords.drop_duplicates() but all of this is very expensive memory wise which is an issue as I run the code on google colab. Is there a way I can do the merge without creating all the duplicate rows?

I have inserted screenshots of ach dataframe to the end of the question to add clarity. apologies if this is the incorrect format I am relatively new here.

df data_0_0 looks like this: enter image description here

& df data looks like this: enter image description here

df data_cords ends up like this like this with the desired columns added to the end of each sequence: enter image description here

Sean
  • 119
  • 6
  • It is difficult to suggest without looking at the data (at least partially). Did you try other kinds of merge? Like ``how='inner'``? – Maxim Ivanov Feb 16 '21 at 18:34
  • @MaximIvanov I have added some screen shots for clarity. To answer your question yes I did but I couldn't get it to work and my method was fine when I was just using a small subsample of my data but now it has this obvious drawback. – Sean Feb 16 '21 at 18:54
  • can you replace the image with clear text data? – Joe Ferndz Feb 16 '21 at 19:10

1 Answers1

0

I actually found the answer in this earlier post here

For my particular code the solution is to run data.drop_duplicates() before running data_cords = data_0_0.merge(data, on= "unique_id", how = "left")

Sean
  • 119
  • 6