Sorting algorithm based on two conditions

Question

Problem

I am working on a Python problem, where a car dealer wants to accumulate a list of vehicles where the total mileage across the chosen vehicles is a maximum (constraint 1, I don't know why he would want the highest mileage but it is what it is) and he has to remain under a certain budget (constraint 2, $300000).

Question

I know how to sort data based on 1 condition, but sorting it based on 2 values is trickier than I thought. What is the best way to achieve my problem? Please see my attempt below.

Small sample of Data

--------------------------------------------------
| Licence | Manufacturer | Price         | Mileage
--------------------------------------------------
|   1     | Audi         |     42000     | 8000
--------------------------------------------------
|   2     | Mercedes     |     33000     | 15000
--------------------------------------------------
|   3     | Lexus        |     38000     | 10000
--------------------------------------------------
|   4     | BMW          |     25000     | 20000
--------------------------------------------------
|   5     | Mercedes     |     55000     | 33000
--------------------------------------------------

My Attempt

I first thought of assigning a sort of weight between the mileage and price, because a car could have high mileage but its price could also be very high, so I thought sorting based just on mileage is incorrect. For example, suppose I have three cars, A, B and C. Car A has 10000 miles and costs $20000. Car B has 20000 miles but costs $40000. In this case, choosing either wouldn't make a difference. But suppose car C has 25000 miles, but it costs $80000. The algorithm should first consider Car A and B before considering C, even though C has the most mileage, it is not 'worth' the price.

So I created a new column that is the ratio between the mileage and price and sorted this list using that ratio as the key and then reversed it to get the ratios starting from the highest value. I then looped through this list, appending the cars to a new list if the total amount didn't exceed the budget.

cost = 0;

with open(fileName, 'r') as inputFile:
    list1 = csv.reader(inputFile, delimiter=' ')
    list2 = [(row[0], row[1], row[2], row[3],  float(row[3])/float(row[2])) for l in list1]
    list2.sort(key = lambda x: x[4])
    list2.reverse()


cars2Buy = []
for l in list2:
    if (cost + int(row[2])) <= 300000:
       cost += int(row[2])
       cars2Buy.append((row[0], row[1], row[2], row[3]))

    else: break

However, I also tried a different dataset and sorted based just on the mileage, for example:

 list2.sort(key = lambda x: x[3]),

instead of

list2.sort(key = lambda x: x[4])

and surprisingly in that particular dataset sorting based just on mileage gave me a list of cars that had more mileage than my 'weighting' algorithm and was still under budget! This must mean my way of solving this problem is flawed, but I can't see why. I would greatly appreciate any suggestions! Thanks.

Use the pandas library - it will make your life a lot easier. https://stackoverflow.com/questions/17141558/how-to-sort-a-dataframe-in-python-pandas-by-two-or-more-columns — moebius, Sep 09 '18 at 12:22
hi @moebius, yes I know of the pandas library and how to use it. In my particular example, I personally didn't feel the need/want to use it. — user3451660, Sep 09 '18 at 14:29

pietroppeter · Answer 1 · 2018-09-09T14:34:47.617

The problem you are describing it seems to me it is a case knapsack problem: you have set of items (list of cars), each with a value (mileage) and a weight (price) and knapsack (selection of cars) with a limited capacity in terms of weight (total budget). You want to have a selection of cars that maximizes the value of cars, while keeping the total weight below the capacity.

You should know that this is a difficult problem (NP-hard) and depending on the size of your data, it might take too long to find the optimal solution. So you often have to recur to an approximate solution.

The algorithm you are describing (ordering by value/weight ratio and selecting top items until knapsack is full) is the greedy algorithm which gives an approximate solution that it is not guaranteed to be optimal. So I would assume that in your case the greedy algorithm it is not finding the optimal solution (while a better solution can be found by selecting top items by value).

A simple case where this happens is the following: assume you have a budget of 10K and a list of 2 cars. One has mileage of 9K and price of 10K, the other with mileage and price both equal to 2K. The second car has better ratio mileage/price (1 instead of 0,9), but if you just select the car with most mileage you have a better solution (in this case it clearly is the optimal solution).

Update

To find implementations that gives you the optimal solution you should google around "knapsack solver python" or something similar. You should find stuff like this (using OR-Tools by Google) or that (using PuLP or other libraries).

hmm very interesting! I didn't think about the scenario you mentioned in that last case, and I see now how it is flawed! Maybe, for now, I could combine both methods, first determining the car list based just on mileage, and thereafter making another list of cars using my 'weighted' method. And then compare the two results, and the one with the greatest total mileage would be the one I choose. What do you think about this temporary option? — user3451660, Sep 09 '18 at 14:20
well, the temporary solution is for sure better than the single 'weighted' method (aka greedy algorithm for knapsack problem), it is a sort of ensemble of algorithms in which you choose the one which performs best. In your case I would be curious to try one of the libraries I mentioned in the update to find the optimal solution (although your dataset might be too big)... — pietroppeter, Sep 09 '18 at 14:31

score 1 · Accepted Answer · answered Sep 10 '18 at 01:54

I agree with Petro, this seems to be a 0/1 knapsack problem, with N=number of cars and W=max price of 300000 (and value=mileage of car).

And yeah, Knapsack is NP-Hard so it doesn't have a polynomial algorithm. However, there is a fairly fast algorithm that runs in O(NW), which is good for a few thousand cars in our case.

We can adapt the 0/1 knapsack algorithm from the wikipedia page so that it uses 2N memory instead of NW to save time allocating memory. As for how the knapsack algorithm works:

m[i][j] is the maximum mileage of the first i cars with the total price of at most j.
since, in every selection of cars, the ith car is either in the selection or not, we can calculate m[i][j] by considering the best value of two cases:
1. the ith car is in the optimal selection. The optimal value for this is the selection of i-1 cars with a total price at most j-cost[i] (i.e. m[i-1][j-cost[i]]
2. the ith car is not in the optimal selection. The optimal value is then the best value for selecting i-1 cars with price at most j (i.e. m[i-1][j])

Code:

#!/usr/bin/env python
import csv

def solve():
    fileName = 'hi.in'
    with open(fileName, 'r') as inputFile:
        list1 = csv.reader(inputFile, delimiter=',')
        list2 = [(int(row[0]), row[1], int(row[2]), int(row[3])) for row in list1]

    cars2Buy = get_cars(list2)
    print(cars2Buy)


def get_cars(l):
    N = len(l)
    W = 300000
    l = [0]+l # make list one-based

    # 2N array
    m = [[0 for _ in range(W+1)] for _ in range(2)]

    w = 2 # weight is the price
    v = 3 # value is the mileage
    cars2Buy = []

    for i in range(1, N+1):
        # space-optimisation, move one to zero
        for j in range(0, W+1):
            m[0][j] = m[1][j]

        for j in range(0, W+1):
            if j-l[i][w] < 0:
                m[1][j] = m[0][j]
            else:
                m[1][j] = max(m[0][j], m[0][j-l[i][w]] + l[i][v])

        # if the optimal value for [1,i] is larger than [1,i-1],
        # then the car must be in the selection
        if m[1][W] > m[0][W]:
            cars2Buy.append(l[i])

    return cars2Buy

def main():
    solve()

main()

Hi D Slee. Thank you very much for taking the time to answer. Since you put in effort by adapting the code to fit my problem, I will accept it as the answer. I will actually end up using the code found [here](https://rosettacode.org/wiki/Knapsack_problem/0-1#Dynamic_programming_solution) because your code for some reason didn't work properly (I think I messed up the indices when trying to adapt it to my actual data set which is a bit different from the sample I provided. ) Of course, that code is virtually the same. Anyway, thanks again! I really appreciate it ! — user3451660, Sep 10 '18 at 16:20

score 0 · Answer 3 · answered Sep 09 '18 at 13:47

0

Try this with pandas it would be much easier, see below example:

import pandas as pd

df = pd.read_csv("filename.csv", lineterminator='\r')  #read csv file into dataframe

df.sort_values('Mileage', ascending=False, inplace=True)  #sort Mileage column greater to smaller

df = df.loc[df['Price'] > 350000]   #filter price column based on condition

print(df)    #print the dataframe

print(df['Manufacturer'])    #you can print a specific column

answered Sep 09 '18 at 13:47

Chandella07

2,089
14
22

Thanks for answering. Yes, I am aware of the pandas library and have used it before, although your answer doesn't quite solve my problem. The budget amount of $300000 is the total amount he has to buy cars. He could buy 10 cars, as long as the total price doesn't exceed his budget of $300000. Of these 10 cars or whatever number of cars the code works out, those cars must have the highest accumulated mileage (the total mileage between them) compared to any other combination. – user3451660 Sep 09 '18 at 14:14

Sorting algorithm based on two conditions

3 Answers3