Pandas select rows which sum is the closest to a given number

Question

I have read a wikipedia table into a dataframe:

https://es.wikipedia.org/wiki/Anexo:Municipios_de_la_Comunidad_de_Madrid

import pandas as pd
from unicodedata import normalize

df = pd.read_html('https://es.wikipedia.org/wiki/Anexo:Municipios_de_la_Comunidad_de_Madrid')
madrid = df[0] 
madrid['Población(2017)'] = madrid['Población(2017)'].apply(lambda x:normalize('NFKD', x)).str.replace(' ','')
madrid['Población(2017)'] = pd.to_numeric(madrid['Población(2017)'])

I had to unicodedata.normalize because the apparent spaces for formatting numbers such as 206 589 was actually a xa0 character

Now, I want to select from that dataframe a subset of cities whose populations adds a total number as close as possible to a given number. I would like to select which cities' populations, added together, would be just above 2,200,000 habitants

I tried variations of this without result

madrid[madrid['Población(2017)'].sum() > 2178000]

the error message:

KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2645             try:
-> 2646                 return self._engine.get_loc(key)
   2647             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: True

Could somebody figure out a condition that select what I want?

Thanks in advance

it returns just Madrid, which population is > 2178000 ``` Nombre Población(2017) Superficie (km²)[1] Mapa Escudo Capitalidad[1] Altitud(msnm)[a][2] 78 Madrid 3182981 60577 NaN NaN Madrid 657 ``` what I want is a selection of the cities which populations adds > 2178000, as closest as possible — Juan Luis Chulilla, Jul 09 '20 at 12:12
@jezrael I think he wants a subset of cities where sum of population of those cities is closest to given value, — shahid hamdam, Jul 09 '20 at 12:19
@shahidhamdam - ya, it is something related by [subset sum](https://stackoverflow.com/questions/9809436/fast-solution-to-subset-sum), not so easy. — jezrael, Jul 09 '20 at 12:25
@jezrael I think I have found a solution. First subtract n from each row and take abs(). Find minimum value and select that row. add original value to a new variable and repeat process with that value instead of n — shahid hamdam, Jul 09 '20 at 13:00
I am having some issues coding this solution. Maybe you can give it a try. — shahid hamdam, Jul 09 '20 at 13:01
@shahidhamdam - hmmm, so not necessary all combinations? Problem is there is 180 values, it is a lot for all combinations :( — jezrael, Jul 09 '20 at 13:01

Rob Raymond · Answer 1 · 2020-07-09T14:24:10.223

exclude Madrid as it is > 2.2M in it's own right
sort by population descending (to get minimum set), calculate cumsum()
Municipios are then set where cumulative total is < 2.2M
filter original data frame by indexes seen in 3

To generate other options, exclude different Municipios or use a different sort

import lxml
import pandas as pd
df = pd.read_html('https://es.wikipedia.org/wiki/Anexo:Municipios_de_la_Comunidad_de_Madrid')
dfm = df[0]
dfm["Población(2017)"] = pd.to_numeric(dfm["Población(2017)"].str.replace("[\xa0]","", regex=True))
exc = ["Madrid","Alcalá de Henares","Móstoles"]
dfl = dfm[~dfm["Nombre"].isin(exc)].sort_values("Población(2017)", ascending=False).cumsum()
dfm[dfm.index.isin(dfl[dfl["Población(2017)"]<2200000].index)] #.sort_values("Población(2017)", ascending=False)

output

Nombre  Población(2017) Superficie (km²)[1]    Mapa    Escudo  Capitalidad[1] Altitud(msnm)[a][2]
4   Alcalá de Henares   194310  8772    NaN NaN Alcalá de Henares   587
5   Alcobendas  114864  4498    NaN NaN Alcobendas  669
6   Alcorcón    168141  3373    NaN NaN Alcorcón    711
12  Aranjuez    58213   20111   NaN NaN Aranjuez    495
44  Collado Villalba    62152   2652    NaN NaN Collado Villalba    918
50  Coslada 83011   1201    NaN NaN Coslada 620
57  Fuenlabrada 194669  3941    NaN NaN Fuenlabrada 662
64  Getafe  178288  7838    NaN NaN Getafe  622
73  Leganés 187720  4309    NaN NaN Leganés 667
79  Majadahonda 71299   3847    NaN NaN Majadahonda 740
90  Móstoles    206589  4536    NaN NaN Móstoles    660
101 Parla   125898  2451    NaN NaN Parla   646
110 Pozuelo de Alarcón  85605   4320    NaN NaN Pozuelo de Alarcón  689
119 Rivas-Vaciamadrid   83767   6738    NaN NaN Vaciamadrid 563
123 Las Rozas de Madrid 95071   5831    NaN NaN Las Rozas de Madrid 712
130 San Sebastián de los Reyes  86707   5866    NaN NaN San Sebastián de los Reyes  672
142 Torrejón de Ardoz   128013  841 NaN NaN Torrejón de Ardoz   586
156 Valdemoro   73976   6417    NaN NaN Valdemoro   613

I think still solution is wrong, because this is only one combinations of values with `cumsum`. Now need all possible combinations, create cumsum with filter. Last filter best combinations of values. — jezrael, Jul 09 '20 at 14:06
@jezrael C(178, 20) ~= 1.5*10^25. this is theoretical but number of combinations will be very big. Human decision making is required to optimise which combinations are needed.. I've further updated solution to make exclusions simpler — Rob Raymond, Jul 09 '20 at 14:29
I have presented solution without trying any combination. can you please test it? @RobRaymond — shahid hamdam, Jul 09 '20 at 14:40
@shahidhamdam honestly I don't see how to use your solution. No where does it get the data from wikipedia and the output you have provided is list of all index values (there are 178 Municipios excluding Madrid) — Rob Raymond, Jul 09 '20 at 15:10
No, actually there are 172 rows in final dataframe and their sum is closest to n. — shahid hamdam, Jul 09 '20 at 15:32
I didn't ask the question provided an answer and just have a jupyter notebook handy ;-) `new_df[0]["Población(2017)"] = pd.to_numeric(new_df[0]["Población(2017)"].str.replace("[\xa0]","", regex=True))` then `final["Población(2017)"].sum()` looks good **2194486**. What to optimise on... number of records or nearest target. Guess that's down to poster to decide :-) — Rob Raymond, Jul 09 '20 at 15:38

shahid hamdam · Answer 2 · 2020-07-09T15:31:39.270

This could be a possible solution without trying any combination. Please test it and let me know.

new_val = 2178000
target = 2178000
col = 'Población(2017)'
indexx = []
out_val = 0
df = pd.read_html('https://es.wikipedia.org/wiki/Anexo:Municipios_de_la_Comunidad_de_Madrid')
x = df[0]
x["Población(2017)"] = pd.to_numeric(x["Población(2017)"].str.replace("[\xa0]","", regex=True))
while target > out_val:
    x['diff_to_n'] = (x[col] - new_val)
    s =  x.where(x['diff_to_n'] == x['diff_to_n'].min())
    ind = s[col].idxmax()
    x.drop(ind, inplace=True)
    indexx.append(ind)
    out_val += s[col].max()
    new_val = new_val - s[col].max()
new_df = pd.read_html('https://es.wikipedia.org/wiki/Anexo:Municipios_de_la_Comunidad_de_Madrid')
final = new_df[0].iloc[indexx]
final

Here out_val is the sum closest to your given value. x is your dataframe, target and new_val is your given value. Col is your columns you want to use. I know this is not a elegant solution but still it is a solution. Final is the result DataFrame.

this is list of index I got for your n = 2178000
this is the value I got which is closest to n = 2194486.0

Pandas select rows which sum is the closest to a given number

2 Answers2