A loop takes 3 days... How do I parallel the operation (or vectorize)?

Question

I have a problem of calculating a decay of a product.

The issue is that I have a large amount of combinations that I need to calculate in order to find the decay of a product. The computation itself is simple, but the computation time goes to the sky because of the many combinations. I have a multinested loop on a multiindex dataframe, and I'm thinking that parallelization could be good - however the parallel solution I made only made it slower, so that was clearly wrong.

To make matters worse (in time) I have only made the loop for 1 country, and I need to expand it to the remaining countries as well...

The loop itself is relatively simple:

##############################################
#The actual computation
##############################################
import time 
start_time = time.time()
#Loop over gr, ip and year
for gr in np.arange(-2.0, 2.0, 0.5):  # Loop over -2.0, -1.5, ..., 2.0
    for l_s in np.arange(-10.0, 10.0, 1.0):
        for ip in [70, 80, 90]:
            for year in np.arange(1942.0, 2100.0, 1.0):  # Loop over years from 1942.0 to 2099.0
                subs = sales_df.loc[(gr, l_s, ip, year)]['Germany'].values                
                diagonal_values = subs
                result_array = np.zeros((len(diagonal_values) + len(decay) - 1, len(diagonal_values)))

                for i, val in enumerate(diagonal_values):
                    result_array[i:i+len(decay), i] = val * decay
                
                decay_total = result_array.sum(axis=1).round(0)
                decay_total_insert = decay_total[0:300]
                
                # Inserting the decayed values into the DataFrame
                subsetted_df.loc[(gr, l_s, ip, year), 'Germany'] = decay_total_insert
end_time = time.time()
execution_time = end_time - start_time

For test purpose I have some generated data that looks like the correct data (which I cannot share). This code could probably also be optimized, but that doesn't matter - I have real data.

import numpy as np
import pandas as pd
##############################################
#Creating sample data
##############################################
# Sample input data can be done smarter, but doesn't matter since I have real data.
years = np.arange(1942, 2100)
gstep = np.array([-2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2])
lstep = np.arange(-10, 11)
iprange = np.arange(70, 91, step=10)
countries = ["Germany", "Sweden", "Norway", "Austria", "Belgium", "France"]  # Add 17 more countries here

# Generate all possible combinations of input variables
inputcombs = np.array(np.meshgrid(years, gstep, lstep, iprange)).T.reshape(-1, 4)
input_df = pd.DataFrame(inputcombs, columns=["year", "gr", "lschange", "ip"])

# Generating sample sales data
np.random.seed(42)  # For reproducibility
sdata = np.random.randint(0, 100, size=(input_df.shape[0], len(countries)))
input_df[countries] = sdata

# Set the specified columns as the index
input_df.set_index(["gr", "lschange", "ip", "year"], inplace=True)

# Sorting index by specified columns
input_df.sort_index(level=["gr", "lschange", "ip", "year"], inplace=True)

# Create the additional index "actual_year" ranging from 1943 to 2100
actual_years = np.arange(1943, 2401)
index_levels = input_df.index.levels + [actual_years]
multi_index = pd.MultiIndex.from_product(index_levels, names=input_df.index.names + ["actual_year"])
output_df = pd.DataFrame(index=multi_index, columns=countries)

# Subsetting the DataFrame to get only values that are higher than year in index.
subsetting_condition = (output_df.index.get_level_values('actual_year') >= (output_df.index.get_level_values('year') + 1)) & \
                      (output_df.index.get_level_values('actual_year') <= (output_df.index.get_level_values('year') + 300))
subsetted_df = output_df[subsetting_condition]

# Get the shape of the existing DataFrame
sales_df = subsetted_df.copy()
# Get the shape of the existing DataFrame
rows, cols = sales_df.shape

# Generate an array of random integers between 5 and 1000 with the same shape as the DataFrame
random_integers = np.random.randint(5, 1001, size=(rows, cols))

# Replace the DataFrame values with the random integers
sales_df.values[:] = random_integers

##############################################
#Creating decay vector
##############################################

# Parameters for the log-normal distribution
mean = 0.0  # Mean of the log-normal distribution
std_dev = 1.0  # Standard deviation of the log-normal distribution
size = 86  # Number of elements in the vector

# Generate log-normal values
log_normal_values = np.random.lognormal(mean, std_dev, size)

# Normalize the vector to sum up to 1
decay = log_normal_values / np.sum(log_normal_values)

I tried parallel functions with joblib (like this one How do I parallelize a simple Python loop?) - it made it slower which I could see others also has struggled with... I tried vectorizing, but the amount I was able to did not speed it up (same time), so I went back to the loop that is - if nothing else - relatively simple to understand the process of.

IMO https://codereview.stackexchange.com/ is a much better place for this question. — Vitalizzare, Aug 30 '23 at 13:39
Do you mind to produce a [mcve]? Data is fine but maybe you can use a smaller function in order to better understand the process — rpanai, Aug 30 '23 at 19:45
It looks like you operate on an *object-typed* dataframe. This is very bad for performance as Pandas use Numpy internally and Numpy cannot *vectorize* object-typed arrays. It is critical to operate on native types to get good performance. The precise type to use is dependent on the actual data in your dataframe. Besides, recreating `result_array` in a loop like this is inefficient. Please consider creating it once and filling it in the loop. — Jérôme Richard, Aug 30 '23 at 22:10

score -1 · Answer 1 · answered Aug 30 '23 at 17:58

Parallelizing nested loops can be challenging due to the nature of dependencies between iterations and the overhead associated with creating and managing parallel threads or processes. However, in your case, you can consider parallelizing the outermost loop and keeping the inner loops sequential. Here's an example of how you could achieve this using the concurrent.futures module for multi-threading:

import numpy as np
import pandas as pd
from concurrent.futures import ThreadPoolExecutor

# ... (Your data generation code)

# Function to process a single combination
def process_combination(gr, l_s, ip, year):
    subs = sales_df.loc[(gr, l_s, ip, year)]['Germany'].values                
    diagonal_values = subs
    result_array = np.zeros((len(diagonal_values) + len(decay) - 1, len(diagonal_values)))

    for i, val in enumerate(diagonal_values):
        result_array[i:i+len(decay), i] = val * decay

    decay_total = result_array.sum(axis=1).round(0)
    decay_total_insert = decay_total[0:300]

    return (gr, l_s, ip, year), decay_total_insert

# Parallel processing
start_time = time.time()
with ThreadPoolExecutor(max_workers=4) as executor:
    futures = [executor.submit(process_combination, gr, l_s, ip, year) for gr in np.arange(-2.0, 2.0, 0.5)
                                                                             for l_s in np.arange(-10.0, 10.0, 1.0)
                                                                             for ip in [70, 80, 90]
                                                                             for year in np.arange(1942.0, 2100.0, 1.0)]

    for future in futures:
        idx, decay_total_insert = future.result()
        subsetted_df.loc[idx, 'Germany'] = decay_total_insert

end_time = time.time()
execution_time = end_time - start_time

The ThreadPoolExecutor is used to parallelize the outermost loop. The inner loops are kept sequential within each thread. The number of workers (max_workers) should be adjusted based on the available cores on your system. You probably have 32.

Parallelization doesn't always guarantee speedup due to the overhead and the Global Interpreter Lock (GIL) in CPython...

A loop takes 3 days... How do I parallel the operation (or vectorize)?

1 Answers1