1

I need to generate a list of populations for different years for each country. The information I need is contained in two dataframes

  1. The first dataframe, gni_per_capita, contains names of countries and years. Countries in this dataframe have a different range of years

  2. The second dataframe, hihd also has names of countries and dates, but the list of countires is more extensive, and there is a wider range of dates for each country. The second dataframe contains the population of each country in each year, the second does not.

I need to generate a list of population for every year of each of the countries in the first dataframe.

I was given the following tip:

1. first, get a unique list of countries from gni_per_capita. 
2. Loop through the list, and get the available years for that country. 
3. Then .loc index hihd to get the population rows where both the country
    and years are correct (hihd.Year.isin(?)). 
4. Append these to the list
    one by one.

Thus far, I have created a series with country and year from the first dataframe

group = gni_per_capita.groupby('Entity')

ync = group.apply(lambda x: x['Year'].unique())

However, I am struggling to combine the second dataframe with the created series

mask = hihd.Year.isin(ync)
Hamster
  • 99
  • 1
  • 3
  • 7
  • Please provide a [minimal reproducible example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). See also: https://stackoverflow.com/help/minimal-reproducible-example – recentadvances Nov 08 '20 at 01:49
  • Here's good examples of how to help people answering your question: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – Sander van den Oord Nov 08 '20 at 07:51

1 Answers1

0

Why don't you use merge?

With the two sample frames

gni_per_capita = pd.DataFrame(
    {'Entity': 2 * ['A'] + 3 * ['B'],
     'Year': list(range(2020, 2020 + 2)) + list(range(2020, 2020 + 3))}
)
hihd = pd.DataFrame(
    {'Entity': 4 * ['A'] + 4 * ['B'] + 4 * ['C'],
     'Year': 3 * list(range(2019, 2023)),
     'Population': list(range(10, 14)) + list(range(20, 24)) + list(range(30, 34))}
)
  Entity  Year
0      A  2020
1      A  2021
2      B  2020
3      B  2021
4      B  2022

   Entity  Year  Population
0       A  2019          10
1       A  2020          11
2       A  2021          12
3       A  2022          13
4       B  2019          20
5       B  2020          21
6       B  2021          22
7       B  2022          23
8       C  2019          30
9       C  2020          31
10      C  2021          32
11      C  2022          33

this

gni_per_capita.merge(hihd, how='left', on=['Entity', 'Year'])

gives you

  Entity  Year  Population
0      A  2020          11
1      A  2021          12
2      B  2020          21
3      B  2021          22
4      B  2022          23

which seems to be what you want. Or did I miss something?

If you want it as a list:

result = (gni_per_capita.merge(hihd, how='left', on=['Entity', 'Year'])
          .to_numpy().tolist())
[['A', 2020, 11], ['A', 2021, 12], ['B', 2020, 21], ['B', 2021, 22], ['B', 2022, 23]]
Timus
  • 10,974
  • 5
  • 14
  • 28