0

I'm creating a merge from two tables. First table look like this:

            a       b    c
0           32    171    28
1           32    172    28
2          1014   173    28
3          1014   179    28
4          1014   154    26
5          1049   156    26

2nd table looks like this:

            a      d
0           32    fdxjgset
1           32    5j4j64j4
2          1014   4564jsr5
3          1014   5jhszxse
4          1014   kuts5555
5          1049   srh5jx5x

I'm expecting to get something like this:

            a       b    c    d
0           32    171    28   fdxjgset 
1           32    172    28   5j4j64j4
2          1014   173    28   4564jsr5
3          1014   179    28   5jhszxse
4          1014   154    26   kuts5555
5          1049   156    26   srh5jx5x

But I'm getting duplicates of the duplicate 'a' rows like this:

            a       b    c    d
0           32    171    28   fdxjgset 
1           32    172    28   5j4j64j4
2           32    171    28   fdxjgset 
3           32    172    28   5j4j64j4
4          1014   173    28   4564jsr5
5          1014   179    28   5jhszxse
6          1014   154    26   kuts5555
7          1014   173    28   4564jsr5
8          1014   179    28   5jhszxse
9          1014   154    26   kuts5555
10         1049   156    26   srh5jx5x

My code is:

data_1 = pd.read_csv("First file.csv",encoding='latin1')
data_2 = pd.read_csv("Second file.csv",encoding='latin1')

data_2_dups = data_zips.drop_duplicates()  #remove duplicates
data = data_1.merge(data_2_dups, on='a', how = 'left', indicator=True)
#data1 = data.drop_duplicates()
data.to_csv("merged file.csv")

Now I did remove all duplicates like others said on different threads here, but that doesn't seem to work. It's still creating duplicate rows for duplicates. Any idea what am I doing wrong? Thanks.

lily
  • 1
  • 1

4 Answers4

1

If the indeces are exactly the same you can simply do this.

df3 = pd.merge(df[['a','b','c']], df2['d'], right_index=True, left_index=True)
BoomBoxBoy
  • 1,770
  • 1
  • 5
  • 23
0

If the indexes of your two dataframes align, you dont need to use merge, and instead use the below to get the d column in your first dataframe.

df1['d'] = df2['d'] 

If they dont align, you can either include a second key to merge on, or do groupby.first

df = pd.merge(df1,df2,on = ['key1','key2']...)

or after you run your code you already have,

df = df.groupby(['a','b']).first()

The reason its duplicating is because your key has duplicates. To remove duplicate keys you would have to use drop_duplicates(subset = ['a'])

rhug123
  • 7,893
  • 1
  • 9
  • 24
0

Try this:

data_1 = pd.read_csv("First file.csv",encoding='latin1')
data_2 = pd.read_csv("Second file.csv",encoding='latin1')

data = data_1.merge(data_2_dups, on='a', how = 'left')
data1 = data.drop_duplicates(subset=['a','d']).reset_index(drop= True)
data.to_csv("merged file.csv")

subset=['a','d'] will only keep one row per 'a' - 'd' pair and reset_index(drop= True), just re-indexes the resulting DataFrame.

0

Using your example data, dropping duplicates in data_2 won't change anything. "drop_duplicates" without any parameters will check against all columns for duplicates.

data_2.drop_duplicates()
            a      d
0           32    fdxjgset
1           32    5j4j64j4
2          1014   4564jsr5
3          1014   5jhszxse
4          1014   kuts5555
5          1049   srh5jx5x

Focusing on "a" = 32, you are getting what you should since there would be 2 rows for "a" in data_1 and also 2 rows in data_2. This should produce 4 rows in the resulting merge.

data_1
            a       b    c
0           32    171    28
1           32    172    28
data_2
            a       b    c    d
0           32    171    28   fdxjgset 
1           32    172    28   5j4j64j4


data_1.merge(data_2_dups, on='a', how = 'left', indicator=True)
            a       b    c    d
0           32    171    28   fdxjgset 
1           32    172    28   5j4j64j4
2           32    171    28   fdxjgset 
3           32    172    28   5j4j64j4

You may decide to keep only one "a" value from data_2, depending on your end goal.

data_2 = data_2.drop_duplicates(subset='a')
Jason Cook
  • 1,236
  • 9
  • 12