0

I have this code that I have been working and creating data based on my actual data. I am using pandas and Python. Here is how my code looks like:

new_df = pd.DataFrame(columns=['dates', 'Column_D', 'Column_A', 'VALUE', 'Column_B', 'Column_C'])
for i in df["dates"].unique():
    for j in df["Column_A"].unique():
        for k in df["Column_B"].unique():
              for m in df["Column_C"].unique():
                    n = df[(df["Column_D"] == 'orange') & (df["dates"] == '2005-1-1') & (df["Column_A"] == j) & (df["Column_B"] == k) & (df["Column_C"] == m)]['VALUE']
                    x = df[(df["dates"] == '2005-1-1') & (df["Column_A"] == j) & (df["Column_B"] == k) & (df["Column_C"] == m)]['VALUE'].sum()
                    tempVal = df[(df["dates"] == i)  & (df["Column_A"] == j) & (df["Column_B"] == k) & (df["Column_C"] == m)]['VALUE'].agg(sum)
                    finalVal = (n * tempVal) / (x - n)
                    if finalVal.empty | finalVal.isnull().values.any() | finalVal.isna().values.any() | np.inf(finalVal).values.any():
                       finalVal = 0
                    finalVal = int(finalVal)

                    new_df = new_df.append({'dates': i, 'Column_D': 'orange', 'Column_A': j, 'VALUE': finalVal, 'Column_B': k, 'Column_C': m}, ignore_index=True)

It takes a long time for my code to run right now and I'm not sure how to fix it and reduce the speed. I suspect the code is written sequentially. Could I get some help to reduce the speed? I want to know how to write my code in parallel and reduce the number of for loops. I heard pyspark is good, but will it help me? Thanks!

user3002936
  • 119
  • 1
  • 1
  • 5
  • 2
    The first line has a syntax error: the parenthesis is not closed and input data from each columns are missing. Besides this you need no to use `new_df.append` in the loops and use groupby rather than filtering again and again the same dataframe with different values. These two thing increase dramatically the running-time algorithmic complexity causing a very slow execution. – Jérôme Richard Feb 06 '22 at 04:28
  • Could you show me how I should rewrite it? I'm not sure – user3002936 Feb 06 '22 at 18:14
  • This is not trivial in your case because of the multiple keys but [here](https://stackoverflow.com/a/70357732/12939557) is a past answer that solved a a similar problem (using both lists to prevent slow Pandas `append` and a `grouby`). I think you can apply exactly the same method on your code. By the way, you can precompute things like `df["Column_D"] == 'orange'` outside the loops. – Jérôme Richard Feb 06 '22 at 19:51
  • I'm still kind of confused after looking at that example. How would I use groupby? – user3002936 Feb 07 '22 at 23:12

0 Answers0