How to find the pareto-optimal solutions in a pandas dataframe

Question

I have a pandas dataframe with the name df_merged_population_current_iteration whose data you can download here as a csv file: https://easyupload.io/bdqso4

Now I want to create a new dataframe called pareto_df that contains all pareto-optimal solutions with regard to the minimization of the 2 objectives "Costs" and "Peak Load" from the dataframe df_merged_population_current_iteration. Further, it should make sure that no duplicate values are stored meaning that if a solution have identical values for the 2 objectives "Costs" and "Peak Load" it should only save one solution. Additionally, there is a check if the value for "Thermal Discomfort" is smaller than 2. If this is not the case, the solution will not be included in the new pareto_df.

For this purpose, I came up with the following code:

import pandas as pd

df_merged_population_current_iteration = pd.read_csv("C:/Users/wi9632/Desktop/sample_input.csv", sep=";")

# create a new DataFrame to store the Pareto-optimal solutions
pareto_df = pd.DataFrame(columns=df_merged_population_current_iteration.columns)

for i, row in df_merged_population_current_iteration.iterrows():
    is_dominated = False
    is_duplicate = False
    for j, other_row in df_merged_population_current_iteration.iterrows():
        if i == j:
            continue
        # Check if the other solution dominates the current solution
        if (other_row['Costs'] < row['Costs'] and other_row['Peak Load'] < row['Peak Load']) or \
                (other_row['Costs'] <= row['Costs'] and other_row['Peak Load'] < row['Peak Load']) or \
                (other_row['Costs'] < row['Costs'] and other_row['Peak Load'] <= row['Peak Load']):
            # The other solution dominates the current solution
            is_dominated = True
            break
        # Check if the other solution is a duplicate
        if (other_row['Costs'] == row['Costs'] and other_row['Peak Load'] == row['Peak Load']):
            is_duplicate = True
            break

    if not is_dominated and not is_duplicate and row['Thermal Discomfort'] < 2:
        # The current solution is Pareto-optimal, not a duplicate, and meets the discomfort threshold
        row_df = pd.DataFrame([row])
        pareto_df = pd.concat([pareto_df, row_df], ignore_index=True)

print(pareto_df)

In most cases, the code works fine. However, there are cases, in which no pareto-optimal solution is added to the new dataframe pareto_df , altough there exist pareto-optimal solutions that fulfill the criteria. This can be seen with the data I posted above. You can see that the solutions with the "id of the run" 7 and 8 are pareto-optimal (and fullfill the thermal discomfort constraint). However, the current code does not add any of those 2 to the new dataframe. It should add one of them (but not 2 as this would be a duplicate). I have to admit that I already tried a lot and had a closer look at the code, but I could not manage to find the mistake in my code.

Here is the actual output with the uploaded data:

Empty DataFrame
Columns: [Unnamed: 0, id of the run, Costs, Peak Load, Thermal Discomfort, Combined Score]
Index: []

And here is the desired output (one pareto-optimal solution):

Do you see what the mistake might be and how I have to adjust the code such that it in fact finds all pareto-optimal solutions without adding duplicates?

Reminder: Does anyone have any idea why the code does not find all pareto-optimal solutions? I'll highly appreciate any comments.

Refrain from showing your dataframe as an image. Your question needs a minimal reproducible example consisting of sample input, expected output, actual output, and only the relevant code necessary to reproduce the problem. See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for best practices related to Pandas questions. — itprorh66, Jun 06 '23 at 19:42
@itprorh66: Thanks for your comment and good advice. I now updated the code and have a minimal reproducible example with sample input data, actual output and desired output as you suggested. — PeterBe, Jun 07 '23 at 09:46
@itprorh66: Thanks for your comment. I adjusted the question according to your comments. Would you mind having a look at it? I'll highly appreciate every further comment from you. — PeterBe, Jun 09 '23 at 07:39
I’m voting to close this question because Open ended and opinion based questions that boil down to subjective responses are generally not a good fit for this site, since there generally is not a single correct answer but a range of opinions based on different approaches. — itprorh66, Jun 09 '23 at 14:14
@itprorh66 This is definitely not an opinion based question. I provided all the things you asked for and made it clear what the output should be. I also have sample input data. Your statement "because Open ended and opinion based questions that boil down to subjective responses" is completely wrong. Also really rude from you to vote to close this question after I have provided everything you asked for. If you don't want to answer this clear question that is fine. But voting to close such that other can't answer it, is quite mean (especially considering your wrong justifications) — PeterBe, Jun 11 '23 at 12:31
You are indeed asking for opinions as indicated by your statement ":Does anyone have any idea why the code does not find all pareto-optimal solutions? I'll highly appreciate any comments." — itprorh66, Jun 14 '23 at 19:07
@itprorh66: I just always write something like this if anyone has an idea as to why something is wrong. From the input and output data and my explanations you can clearly see, that the code can't be correct as the desired output differs from the real output. So your "concern" is just a matter of wording. — PeterBe, Jun 15 '23 at 09:30

score 0 · Answer 1 · answered Aug 04 '23 at 18:01

Condition for testing dominance should be written more strictly. The culprit seems to be the last if clause where you are checking both non dominance and duplicacy.

Your old code has bug which will add a row to output(pareto_df) DataFrame only when it is non dominated and also not duplicate simultaneously. This condition will not work if you have duplicate rows in your input DataFrame. If two rows are duplicate, we should add one of them as they are non dominated w.r.t each other. Old code is not doing it properly and hence the empty DataFrame.

You should remember that only if a point remains undominated we will add it to pareto dataframe. Duplicacy in output will be handled through drop_duplicates.

df_merged_population_current_iteration = pd.read_csv("C:/Users/wi9632/Desktop/sample_input.csv", sep=";")

# create a new DataFrame to store the Pareto-optimal solutions
pareto_df = pd.DataFrame(columns=df_merged_population_current_iteration.columns)

for i, row in df_merged_population_current_iteration.iterrows():
    is_dominated = False
    is_duplicate = False
    for j, other_row in df_merged_population_current_iteration.iterrows():
        if i == j:
            continue
        # Check if the other solution dominates the current solution
        if (other_row['Costs'] < row['Costs'] and other_row['Peak Load'] < row['Peak Load']):
            # The other solution dominates the current solution and hence row cannot be added to pareto set.
            is_dominated = True
            break
        # Check if the other solution is a duplicate
        if (other_row['Costs'] == row['Costs'] and other_row['Peak Load'] == row['Peak Load']):
            is_duplicate = True
            break

    if not is_dominated and row['Thermal Discomfort'] < 2:
        # The current solution is Pareto-optimal, and meets the discomfort threshold
        row_df = pd.DataFrame([row])
        pareto_df = pd.concat([pareto_df, row_df], ignore_index=True).drop_duplicates()

print(pareto_df)

How to find the pareto-optimal solutions in a pandas dataframe

1 Answers1