6

I have a dataframe of 10,000 rows that I am trying to sum all possible combinations of those rows. According to my math, that's about 50 million combinations. I'll give a small example to simplify what my data looks like:

df = Ratio     Count     Score
     1         6         11
     2         7         12
     3         8         13
     4         9         14
     5         10        15

And here's the desired result:

results = Min Ratio     Max Ratio     Total Count     Total Score
          1             2             13              23
          1             3             21              36
          1             4             30              50
          1             5             40              65
          2             3             15              25
          2             4             24              39
          2             5             34              54
          3             4             17              27
          3             5             27              42
          4             5             19              29

This is the code that I came up with to complete the calculation:

for i in range(len(df)):
    j = i + 1
    while j <= len(df):
        range_to_calc = df.iloc[i:j]
        total_count = range_to_calc['Count'].sum()
        total_score = range_to_calc['Score'].sum()
        new_row = {'Min Ratio': range_to_calc.at[range_to_calc.first_valid_index(),'Ratio'],
                   'Max Ratio': range_to_calc.at[range_to_calc.last_valid_index(),'Ratio'],
                   'Total Count': total_count,
                   'Total Score': total_score}
        results = results.append(new_row, ignore_index=True)
        j = j + 1

This code works, but according to my estimates after running it for a few minutes, it would take 200 hours to complete. I understand that using numpy would be a lot faster, but I can't wrap my head around how to build multiple arrays to add together. (I think it would be easy if I was doing just 1+2, 2+3, 3+4, etc., but it's a lot harder because I need 1+2, 1+2+3, 1+2+3+4, etc.) Is there a more efficient way to complete this calculation so it can run in a reasonable amount of time? Thank you!

P.S.: If you're wondering what I want to do with a 50 million-row dataframe, I don't actually need that in my final results. I'm ultimately looking to divide the Total Score of each row in the results by its Total Count to get a Total Score Per Total Count value, and then display the 1,000 highest Total Scores Per Total Count, along with each associated Min Ratio, Max Ratio, Total Count, and Total Score.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
mr7
  • 85
  • 5
  • 1
    I does not answer your question, but do you really need to pre-compute and store all combinations ? Could you not use something like `df[2:4].sum()` when needed ? – obchardon Apr 19 '21 at 12:09
  • I just edited my post to add a P.S. that I think addresses your question. I ultimately want to compare each row's value, and since I don't know which ratios will provide the highest Scores Per Count, I think the only way I can do that is to store all computed combinations, (or at least store the top 1000 and kick out the lowest as higher ones are computed.) – mr7 Apr 19 '21 at 12:14
  • So you are not considering the single row as a combination? – Amit Vikram Singh Apr 19 '21 at 12:20
  • You've created an N-combinatorics sized problem where N=10000, you will absolutely have to wait a week for that to finish. You can speed things up, extracting every rev out of your intel processor with python vectorization see https://www.youtube.com/watch?v=EEUXKG97YRw – Eric Leschinski Apr 19 '21 at 12:24
  • @mr7 let's say that you have `k` entries in your dataframe and no negative score, the biggest score will be `df[1:k].sum().Score`, then check if `df[1:k-1].sum().Score` or `df[2:k].sum().Score` is the second biggest possible sum and repeat the operation until you have found the 1000 biggest score. It will save you a week of computation. – obchardon Apr 19 '21 at 12:24
  • @Amit Vikram Singh I consider each single row of the results as an individual combination of all combinations of the original df, if that makes sense. – mr7 Apr 19 '21 at 12:24
  • This does not fully answer your question but [appending rows directly to a DataFrame is highly inefficient](https://stackoverflow.com/questions/27929472/improve-row-append-performance-on-pandas-dataframes). Consider putting all rows in a list and then convert it to a DataFrame with `DataFrame.from_dict()` instead of doing `results = results.append(new_row, ignore_index=True)`. – Gustave Coste Apr 19 '21 at 12:27
  • @Eric Leschinski, it does seem that I'm not maximizing my computer's processing power. I'll look into this to see if it helps. Thanks! – mr7 Apr 19 '21 at 12:29
  • @mr7 In your current code you should put `j < len(df)` and `range_to_calc = df.iloc[i:j+1]`, otherwise it doesn't match with your desired output. – Amit Vikram Singh Apr 19 '21 at 12:30
  • @mr7 Correct you are only using about 1% of your computer's full cpu power for this given task. If you want the explanation down to the bare metal for why python is going excruciatingly slow for this task you've given it, take 20 minutes and read this: https://stackoverflow.com/a/11227902/445131 – Eric Leschinski Apr 19 '21 at 12:32
  • @mr7 Is Min_ratio is `min` value or the `first` value in the combination? – Amit Vikram Singh Apr 19 '21 at 12:35
  • @mr7 Can you try running my answer and let me know the time it takes. – Amit Vikram Singh Apr 19 '21 at 12:47
  • @Amit Vikram Singh, Min_ratio would be the first value in the combination. So if I am calculating the sum of the first three rows in my example, I am summing the Counts and Scores of the Ratios 1, 2, & 3, thus the Min Ratio is 1 and Max Ratio is 3, but I'm summing all Ratios including and in between the Mix and Max Ratios. I am excited to try your answer, but I just started work, so I will have to wait about 10 hours to try it. :-( – mr7 Apr 19 '21 at 12:55
  • @Eric Leschinski I'll look into it! Thanks, again! – mr7 Apr 19 '21 at 12:57
  • @Gustave Coste, thanks for the suggestion. Dictionaries scare me, but I will try to learn how to use them. :-) – mr7 Apr 19 '21 at 13:00
  • Your code does not give the expected result on my machine. Which one is correct? The code or the result? With your code I got `1 1 6 11` for the first line. – Jérôme Richard Apr 19 '21 at 13:31
  • @Jérôme Richard, the results I show above were manually created by me. Yes, technically having a Min Ratio and Max Ratio of the same value (in this case, 1) would be a valid result. I didn't consider that when making my fabricated results table. I guess I assumed with 10,000 rows of data and at least 50 million results to sort through, that an equal Min Ratio and Max Ratio probably would not end up on my list of 1,000 highest Total Scores per Count. – mr7 Apr 19 '21 at 14:14
  • I really appreciate everyone's work and input! This is so cool! – mr7 Apr 20 '21 at 00:04

4 Answers4

3

After these improvements it takes ~2 minutes to run for 10k rows.

  1. For the sum computation, you can pre-compute cumulative sum(cumsum) and save it. sum(i to j) is equal to sum(0 to j) - sum(0 to i-1). Now sum(0 to j) is cumsum[j] and sum(0 to i - 1) is cumsum[i-1]. So sum(i to j) = cumsum[j] - cumsum[i - 1]. This gives significant improvement over computing sum each time for different combination.

  2. Operation over numpy array is faster than the operation on pandas series, hence convert every colum to numpy array and then do the computation over it.

  3. (From other answers): Instead of appending in list, initialise an empty numpy array of size ((n*(n+1)//2) -n , 4) and use it to save the results.

Use:

count_cumsum = np.cumsum(df.Count.values)
score_cumsum = np.cumsum(df.Score.values)
ratios = df.Ratio.values
n = len(df)
rowInCombination = (n * (n + 1) // 2) - n
arr = np.empty(shape = (rowInCombination, 4), dtype = int)
k = 0
for i in range(len(df)):
    for j in range(i + 1, len(df)):
        arr[k, :] = ([
              count_cumsum[j] - count_cumsum[i-1] if i > 0 else count_cumsum[j], 
              score_cumsum[j] - score_cumsum[i-1] if i > 0 else score_cumsum[j],
              ratios[i],
              ratios[j]])
        k = k + 1
out = pd.DataFrame(arr, columns = ['Total_Count', 'Total_Score', 
                    'Min_Ratio', 'Max_Ratio'])

Input:

df = pd.DataFrame({'Ratio': [1, 2, 3, 4, 5], 
                   'Count': [6, 7, 8, 9, 10],
                   'Score': [11, 12, 13, 14, 15]})

Output:

>>>out

  Min_Ratio Max_Ratio   Total_Count Total_Score
0   1     2              13                 23
1   1     3              21                 36
2   1     4              30                 50
3   1     5              40                 65
4   2     3              15                 25
5   2     4              24                 39
6   2     5              34                 54
7   3     4              17                 27
8   3     5              27                 42
9   4     5              19                 29
Amit Vikram Singh
  • 2,090
  • 10
  • 20
3

First of all, you can improve the algorithm. Then, you can speed up the computation using Numpy vectorization/broadcasting.

Here are the interesting point to improve the performance of the algorithm:

  • append of Pandas is slow because it recreate a new dataframe. You should never use it in a costly loop. Instead, you can append the lines to a Python list or even directly write the items in a pre-allocated Numpy vector.
  • computing partial sums take an O(n) time while you can pre-compute the cumulative sums and then just find the partial sum in constant time.
  • CPython loops are very slow, but the inner loop can be vectorized using Numpy thanks to broadcasting.

Here is the resulting code:

import numpy as np
import pandas as pd

def fastImpl(df):
    n = len(df)
    resRowCount = (n * (n+1)) // 2
    k = 0

    cumCounts = np.concatenate(([0], df['Count'].astype(int).cumsum()))
    cumScores = np.concatenate(([0], df['Score'].astype(int).cumsum()))
    ratios = df['Ratio'].astype(int)
    minRatio = np.empty(resRowCount, dtype=int)
    maxRatio = np.empty(resRowCount, dtype=int)
    count = np.empty(resRowCount, dtype=int)
    score = np.empty(resRowCount, dtype=int)

    for i in range(n):
        kStart, kEnd = k, k+(n-i)
        jStart, jEnd = i+1, n+1
        minRatio[kStart:kEnd] = ratios[i]
        maxRatio[kStart:kEnd] = ratios[i:n]
        count[kStart:kEnd] = cumCounts[jStart:jEnd] - cumCounts[i]
        score[kStart:kEnd] = cumScores[jStart:jEnd] - cumScores[i]
        k = kEnd
    assert k == resRowCount

    return pd.DataFrame({
        'Min Ratio': minRatio,
        'Max Ratio': maxRatio,
        'Total Count': count,
        'Total Score': score
    })

Note that this code give the same results than the code in your question, but the original code does not give the expected results stated in the question. Note also that since inputs are integers, I forced Numpy to use integers for sake of performance (despite the algorithm should work with floats too).

This code is hundreds of thousand times faster than the original code on big dataframes and it succeeds to compute a dataframe of 10,000 rows in 0.7 second.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
3

Others have explained why your algorithm was so slow so I will dive into that.

Let's take a different approach to your problem. In particular, look at how the Total Count and Total Score columns are calculated:

  • Calculate the cumulative sum for every row from 1 to n
  • Calculate the cumulative sum for every row from 2 to n
  • ...
  • Calculate the cumulative sum for every row from n to n

Since cumulative sums are accumulative, we only need to calculate it once for row 1 to row n:

  • The cumsum of (2 to n) is the cumsum of (1 to n) - (row 1)
  • The cumsum of (3 to n) is the cumsum of (2 to n) - (row 2)
  • And so on...

In other words, the current cumsum is the previous cumsum minus its first row, then dropping the first row.


As you have theorized, pandas is a lot slower than numpy so we will convert everthing into numpy for speed:

arr = df[['Ratio', 'Count', 'Score']].to_numpy() # Convert to numpy array

tmp = np.cumsum(arr[:, 1:3], axis=0)       # calculate cumsum for row 1 to n
tmp = np.insert(tmp, 0, arr[0, 0], axis=1) # create the Min Ratio column
tmp = np.insert(tmp, 1, arr[:, 0], axis=1) # create the Max Ratio column

results2 = [tmp]
for i in range(1, len(arr)):
    tmp = results2[-1][1:] # current cumsum is the previous cumsum without the first row
    diff = results2[-1][0] # the previous cumsum's first row

    tmp -= diff            # adjust the current cumsum
    tmp[:, 0] = arr[i, 0]  # new Min Ratio
    tmp[:, 1] = arr[i:, 0] # new Max Ratio
    results2.append(tmp)

# Assemble the result
results2 = np.concatenate(results2).reshape(-1,4)
results2 = pd.DataFrame(results2, columns=['Min Ratio', 'Max Ratio', 'Total Count', 'Total Score'])

During my test, this produces the results for a 10k row data frame in about 2 seconds.

Code Different
  • 90,614
  • 16
  • 144
  • 163
  • Thank you for the explanation! It's very smart thinking, something that never would have crossed my mind! – mr7 Apr 20 '21 at 00:04
0

Sorry to write late for this topic, but I'm just looking for a solution for a similar topic. The solution for this issue is simple because the combination is only in pairs. This is solved by uploading the dataframe to any DB and executing the following query whose duration is less than 10 seconds:

SEL f1.*,f2.*,f1.score+f2.score 
FROM table_with_data_source f1, table_with_data_source f2
where f1.ratio<>f2.ratio;

The database will do it very fast even if there are 100,000 records or more.

However, none of the algorithms that I saw in the answers, actually perform a combinatorial of values. He only does it in pairs. The problem really gets complicated when it's a true combinatorial, for example:

Given: a, b, c, d and e as records:

a
b
c
d
e

The real combination would be:

a+b
a+c
a+d
a+e
a+b+c
a+b+d
a+b+e
a+c+d
a+c+e
a+d+e
a+b+c+d
a+b+c+e
a+c+d+e
a+b+c+d+e
b+c
b+d
b+e
b+c+d
b+c+e
b+d+e
c+d
c+e
c+d+e
d+e

This is a real combinatorial, which covers all possible combinations. For this case I have not been able to find a suitable solution since it really affects the performance of any HW. Anyone have any idea how to perform a real combinatorial using python? At the database level it affects the general performance of the database.

WescoT
  • 37
  • 4