I have two dataframes in pandas, one of which, 'datapanel', has country data for multiple years, and the other, 'data', has country data for only one year, but also includes a "Regional indicator" column for each country. I simply want to create a new column in the datapanel frame that gives the 'Regional indicator' for each country. For some reason, the rows of the dataframe are just about doubling after this merge, whereas they should remain the same. What am I doing wrong?
Asked
Active
Viewed 86 times
1
-
you have duplicates in your dataframe so you've created a product of the dataframe, drop the duplicates first or use something like `.map` – Umar.H Mar 26 '21 at 17:56
-
Does this answer your question? [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) – Umar.H Mar 26 '21 at 18:04
-
I'm still confused - my original datapanel dataframe does not have duplicates. For example, those first two rows (Afghanistan 2008) only appear once in "datapanel" and not at all in "data". I don't understand why it's duplicating in this merge. – David Gallagher Mar 26 '21 at 19:36
1 Answers
1
The key (country name) you are merging on is duplicated in 'datapanel' (see 'Afghanistan' mentioned at least 5 times) and perhaps also in 'data', which causes troubles.
Try using a different technique (v-lookup), something like this ("Country name" must be unique in 'data'):
for country in data["Country name"].values:
indicator = data.loc[data["Country name"] == country, "Regional indicator"].item()
datapanel.loc[datapanel["Country name"] == country, "Regional indicator"] = indicator

Laurent
- 12,287
- 7
- 21
- 37
-
1Indeed, there was an orphan bracket that needed to be removed, sorry. I updated my answer accordingly. – Laurent Mar 27 '21 at 16:18
-
Since Pandas is new to me, the use of .loc with a comma between the mask and the "Regional indicator" doesn't seem natural to me. I solved it as below: `for country in data["Country name"].values: indicator = data[data["Country name"] == country]['Regional indicator'].item() datapanel.loc[datapanel["Country name"] == country, "Regional indicator"] = indicator` Which isn't pretty, but I'm struggling to get code to format as code. Not my best day. :) – David Gallagher Mar 27 '21 at 16:47
-
1That was also my way of writing pandas code for a long time, but I find it easier now with `loc`. And I learned that there's more in play than just style preferences, see this great post: https://stackoverflow.com/a/48411543/11246056 – Laurent Mar 27 '21 at 17:13