Prevent duplication of rows when performing merge

Question

I've hit a wall with a data analysis project I'm working on.

Essentially, if I have example CSV 'A':

id   | item_num
A123 |     1
A123 |     2
B456 |     1

And I have example CSV 'B':

id   | description
A123 | Mary had a...
A123 | ...little lamb.
B456 | ...Its fleece...

If I perform a merge using Pandas, it ends up like this:

id   | item_num | description
A123 |     1    | Mary had a...
A123 |     2    | Mary had a...
A123 |     1    | ...little lamb.
A123 |     2    | ...little lamb.
B456 |     1    | Its fleece...

How could I instead make it become:

id   | item_num | description
A123 |     1    | Mary had a...
A123 |     2    | ...little lamb...
B456 |     1    | Its fleece...

This is my code:

import pandas as pd

# Import CSVs
first = pd.read_csv("../PATH_TO_CSV/A.csv")
print("Imported first CSV: " + str(first.shape))
second = pd.read_csv("../PATH_TO_CSV/B.csv")
print("Imported second CSV: " + str(second.shape))


# Create a resultant, but empty, DF, and then append the merge.
result = pd.DataFrame()
result = result.append(pd.merge(first, second), ignore_index = True)
print("Merged CSVs... resulting DataFrame is: " + str(result.shape))

# Lets do a "dedupe" to deal with an issue on how Pandas handles datetime merges
# I read about an issue where if datetime is involved, duplicate entires will be created.
result = result.drop_duplicates()
print("Deduping... resulting DataFrame is: " + str(result.shape))

# Save to another CSV
result.to_csv("EXPORT.csv", index=False)
print("Saved to file.")

I would really appreciate any help - I'm very stuck! And I'm dealing with 20,000+ rows.

Thanks.

Edit: my post was marked as a potential duplicate. It's not, as I'm not necessarily trying to add a column - I'm just trying to prevent the description to be multiplied by the number of item_num that are attributed to a particular id.

UPDATE, 6/21:

How could I do the merge, if the 2 DFs looked like this instead?

id   | item_num | other_col
A123 |     1    | lorem ipsum
A123 |     2    | dolor sit
A123 |     3    | amet, consectetur
B456 |     1    | lorem ipsum

And I have example CSV 'B':

id   | item_num | description
A123 |     1    | Mary had a...
A123 |     2    | ...little lamb.
B456 |     1    | ...Its fleece...

So I end up with:

id   | item_num |  other_col  | description
A123 |     1    | lorem ipsum | Mary Had a...
A123 |     2    | dolor sit   | ...little lamb.
B456 |     1    | lorem ipsum | ...Its fleece...

Meaning, the row that has the 3, with "amet, consectetur" in the "other_col" is ignored.

Possible duplicate of [Adding new column to existing DataFrame in Python pandas](http://stackoverflow.com/questions/12555323/adding-new-column-to-existing-dataframe-in-python-pandas) — TemporalWolf, May 03 '17 at 20:18
It looks like you want to [`concat` or `append`](http://pandas.pydata.org/pandas-docs/stable/merging.html), not `merge`. — TemporalWolf, May 03 '17 at 20:21

score 1 · Answer 1 · answered May 03 '17 at 19:46

1

try indexing your df and then drop duplicates:

df = df.set_index(['id', 'item_num']).drop_duplicates()

answered May 03 '17 at 19:46

Steven G

16,244
8
53
77

So I tried that, and it seems to have removed both columns and all data... but that did solve the issue of duplication as the remaining data was not duplicated like before. – kabaname May 03 '17 at 20:13

score 1 · Answer 2 · answered May 03 '17 at 20:10

1

I think you need concat

result = pd.concat([df1.set_index('id'), df2.set_index('id')],axis = 1).reset_index()

You get

    id      item_no     description
0   A123    1           Mary had a...
1   A123    2           ...little lamb
2   B456    1           ...Its fleece...

answered May 03 '17 at 20:10

Vaishali

37,545
5
58
86

I'm getting a `ValueError: Shape of passed values is (13, 10799), indices imply (13, 6240)` – kabaname May 03 '17 at 21:52

MaxU - stand with Ukraine · Accepted Answer · 2017-05-03T22:00:50.480

1

I'd do it this way:

In [135]: result = A.merge(B.assign(item_num=B.groupby('id').cumcount()+1))

In [136]: result
Out[136]:
     id  item_num       description
0  A123         1     Mary had a...
1  A123         2   ...little lamb.
2  B456         1  ...Its fleece...

Explanation: we can create "virtual" item_num column in the B DF for joining:

In [137]: B.assign(item_num=B.groupby('id').cumcount()+1)
Out[137]:
     id       description  item_num
0  A123     Mary had a...         1
1  A123   ...little lamb.         2
2  B456  ...Its fleece...         1

edited May 03 '17 at 22:00

answered May 03 '17 at 20:26

MaxU - stand with Ukraine

205,989
36
386
419

I wish this had worked for me, but it seems to have just not included any of the data from one of the CSVs. In fact, the resulting CSV is just a copy of one of the CSVs. – kabaname May 03 '17 at 21:59
1

@kabaname, are you sure you have assigned the result of merging back? – MaxU - stand with Ukraine May 03 '17 at 22:03
nevermind, so I got it to produce the result - but it's still multiplying the rows so as to repeat the descriptions for `1` and `2` just like in my example. In other words, `Mary had a...` is repeating for both 1 and 2 and then `...little lamb.` is also repeating, still. @maxu – kabaname May 04 '17 at 02:59
Just thought I'd circle back and let you know this works fine! Thanks for your help. – kabaname Jun 04 '17 at 01:04
Hi @MaxU, I have an update to this problem and was wondering if you could shed some insight? – kabaname Jun 21 '17 at 17:09

Prevent duplication of rows when performing merge

3 Answers3