2

Raw data in Excel (as screenshot) of 3 columns. The script is to calculate the result by a simple formula with the columns. When the result reaches a limit, it prints result.

import pandas as pd

df = pd.read_excel("C:\excel_file.xlsx", sheet_name = "Sheet1")

P1 = df['Period 1']
P2 = df['Period 2']
P3 = df['Period 3']

df['Predict'] = 12.5 + (0.35 * P1 + 0.5 * P2 + 0.8 * P3)

for index, row in df.iterrows():
    if row['Predict'] >= 100:
        print row['SKU and Product code']

The problem: many rows in a file and there > 100k files. Now it takes about 3 full days to complete 1 run.

Calculation and the logic are simple. but data volume and number of files are huge, for this frequent task.

When no other options in the raw data (not to reduce numbers of rows and files). I am wondering if GPU programming is a choice to shorten the processing time.

I googled, flipped through a book, and got a fleeing GPU programming is more for advanced tasks like machine learning etc.

How is the rewritten code looked like for above, if GPU programming can be used for this case? Thank you.

enter image description here

talonmies
  • 70,661
  • 34
  • 192
  • 269
Mark K
  • 8,767
  • 14
  • 58
  • 118
  • 1
    "many rows" means how many rows? – Sraw Jul 12 '19 at 07:04
  • @Sraw, thank you for the comment. the number of rows ranges from 600 to 3500 differently in each file. – Mark K Jul 12 '19 at 07:05
  • Hum, I don't think there are really "many" rows totally. And I don't think GPU will really help you in this case. You should first monitor your CPU usage and if it is low, try using `multiprocessing` first. – Sraw Jul 12 '19 at 07:10
  • @Sraw, I am now split the files in groups and have multiple scripts conducting the similar tasks. CPU usage 100% already. – Mark K Jul 12 '19 at 07:11
  • 1
    Well, with the limited information I cannot give more advice. But at least GPU is not your saver. There might be many reasons such as if you have a slow disk, or you should consider a faster reader for excel files. See some relatives: https://stackoverflow.com/questions/50695778/how-to-increase-process-speed-using-read-excel-in-pandas – Sraw Jul 12 '19 at 07:29
  • @Sraw, thanks again. (I did try convert them into csv) – Mark K Jul 12 '19 at 07:31

3 Answers3

2

Is your calculation really "return all the rows that meet this criteria"? (I am not a pandas pro, so maybe missing something). If I am reading it right, 6000 multiplications and comparisons should take drastically less then a second. In a quick experiment, The length of the data to get it to take a whole second was around 700k rows on my cheap laptop.

However, there are only 86,400 seconds in a day- so even if each file takes 1 second to load, its too much here! From similar quick experimentation, the csv only had to be 25k rows long to make reading the data in from a file take longer then a second. This is a helpful hint- reading the files dominates the calculations in this case, so we can aim at that.

When we have a job like this, there are a couple general approaches. Parallelization is the first. If we can do 10 files at a time, this drops us to around 2 hours in the case that each takes 1s. This is... probably not possible. We'll saturate the disk at some point... but in a case like this, we want the disk to be the bottleneck. For that, you would use a multiprocessing threadpool, where you are passing the NAME of the file into the work queue. (you don't want to read the data and pass the data- reading the data is your bottleneck).

Assuming there are limits to that gain, The next step is distribution. Where are those files stored? If they are on s3 or an NFS or something like that, you could easily use something like fabric to run a script on multiple machines. 10 machines, each doing 10 at a time, cuts us from hours to minutes, which is probably enough.

However, there is an even better trick- presumably, these 100k files aren't all being created new every day. Load them into a system designed for distributed parralelized computation like BigQuery, Redshift, or Hadoop/Spark with HDFS. Then, the "READ" part of the work only has to be done new for each days new files, and you can just run the calculations.

Paul Becotte
  • 9,767
  • 3
  • 34
  • 42
1

Python is an interpreter language. Probably using a compiled languages such as C is faster. But this means starting from scratch. The code using C gets a lot more complex because you will have to work with pointers. (Unless you can throw away the data after every row. So your data in the memory remains limited) Furthermore I must admit I never used C to read out Exel sheets. May be your code in fact needs most of the time to read and extract the EXEL files while the calculations are of minor importance. If read time dominates even C may soon reach it's limit. Try if there are differences whether the data is on a slow hard disk or on a fast (SSD) disk.

  • I am using quite a decent computer (CPU, hard disk, RAM and graphic card hardware all above general standards) for this task. – Mark K Jul 12 '19 at 07:28
  • 1
    This answer is basically "oh well, its python's fault" ... that is clearly not the case, as python (including the many highly optimized third party libraries) is used in the vast majority of professional data processing tasks. – Paul Becotte Aug 24 '19 at 16:23
  • To supplement PaulBecotte's comment: the performance-critical internal parts of the numpy and pandas libraries are in fact implemented as pre-compiled C code. – Xukrao Aug 24 '19 at 21:57
1

You can also still optimize your existing python code. I'm certain that you'll see a significant computional performance improvement when you replace the iterrows() loop with purely vectorized operations:

import pandas as pd

df = pd.read_excel(r"C:\excel_file.xlsx", sheet_name="Sheet1")

df['Predict'] = 12.5 + (0.35 * df['Period 1'] + 0.5 * df['Period 2'] + 0.8 * df['Period 3'])

row_mask = df['Predict'] >= 100
target_columns = ['SKU and Product code', 'Period 1', 'Period 2', 'Period 3']
df2 = df.loc[row_mask, target_columns]

print(df2)

If your performance bottleneck lies with I/O operations to/from disk, then it won't make a big dent into overall execution times. However since it's a rather quick and straightforward code change, I'd say it's worth trying in any case.

P.S.: I'd recommend to get into the habit of always writing vectorized pandas operations instead of pandas operations with for-loops This is basically how pandas is intended to be used and what gives you the best computational speed. See here and here for some good guides on pandas python code optimization.

Xukrao
  • 8,003
  • 5
  • 26
  • 52