0

I have a python program that crunches a large dataset using Pandas. It currently takes about 15 minute to complete. I want to log (stdout & send the metric to Datadog) about the progress of the task. Is there a way to get the %-complete of the task (or a function)? In the future, I might be dealing with larger datasets. The Python task that I am doing is a simple grouping of a large pandas data frame. Something like this:

dfDict = {}
for cat in categoryList:
    df1 = df[df['category'] == cat]
    if len(df1.index) > 0:
        df1[dateCol] = pd.to_datetime(df[dateCol])
        dfDict[cat] = df1

here, the categoryList has about 20000 items, and df is a large data frame having (say) a 5 million rows.

I am not looking for anything fancy (like progress-bars..). Just percentage complete value. Any ideas?

Thanks!

user1717931
  • 2,419
  • 5
  • 29
  • 40
  • Possible duplicate of [Python Progress Bar](http://stackoverflow.com/questions/3160699/python-progress-bar) – dodell Sep 15 '16 at 14:10

2 Answers2

0

You can modify the following according to your needs.

from time import sleep

for i in range(12):
    sleep(1)
    print("\r\t> Progress\t:{:.2%}".format((i + 1)/12), end='')

What this basically does, is that it prevents print() from writing the default end character (end='') and at the same time, it write a carriage return ('\r') before anything else. In simple terms, you are overwriting the previous print() statement.

Ma0
  • 15,057
  • 4
  • 35
  • 65
0

the naive solution would be to just use the total amount of rows in your dataset and the index your are at, then calculate the progress:

size = len(dataset)
for index, element in enumerate(dataset):
    print(index / size * 100)

This will only be somewhat reliable if every row takes around the same time to complete. Because you have a large dataset, it might average out over time, but if some rows take a millisecond, and another takes 10 minutes, the percentage will be garbage.

Also consider rounding the percentage to one decimal:

size = len(dataset)
for index, element in enumerate(dataset):
    print(round(index / size * 100), 1)

Printing for every row might slow your task down significantly so consider this improvement:

size       = len(dataset)
percentage = 0
for index, element in enumerate(dataset):
    new_percentage = round(index / size * 100), 1)
    if percentage != new_percentage:
        percentage = new_percentage
        print(percentage)

There are, of course, also modules for this:

progressbar

progress

mzhaase
  • 274
  • 4
  • 13
  • exactly! some subsetting will be faster and some will take longer. I have seen progress bars...and my gut-feeling says they will also behave similarly. However, I will take a hard-look at progress bar now. – user1717931 Sep 15 '16 at 14:14