1

I've a python script which loads data from 10k - 12k files and performs some operations. Sometime, this process takes hours. This such cases, I would like to see how much progress has been made by the python script.

Let's say if I'm loading 10,000 files using for loop, I don't want to do something like:

if n % 100 == 0:
    print("%d steps completed!" % n)

as this will unnecessarily evaluate the if condition for thousands of times. I know that the total cost of this if statement will still be small compared to hours it takes to run the script, however, I was curious if python has any efficient feature to keep track of the progress.

  • Any such feature would involve an evaluation of the progress, and so would be no better than your simple `print()` every 100 files. You have correctly recognized that the `if` statement will cost next to nothing compared to all the IO you're doing, so you're trying to solve a non-issue. – Pranav Hosangadi Apr 11 '22 at 15:49
  • If you have concerns about efficiency or performance, the best way is just to check on your end. You can load 100 files and check performance with and without using if's. Of course, I believe a simple condition will have no discernable impact on performance. – chomprrr Apr 11 '22 at 15:55
  • The condition takes about 40 ns on my machine and any basic operations like `n+1` takes already 30 ns. This is close to the minimum time of an instruction taken by the CPython interpreter. If you want a faster code, then you definitively need not to use Python (and more especially CPython). Loading a file should take far more than 1000 ns whatever the target system (it require 3 expensive syscalls). – Jérôme Richard Apr 11 '22 at 15:57
  • I think another useful thing to consider is a way to keep track of which files have be processed and which haven't. This way, if your script ends up failing, you can always know what else needs to be processed. For this, consider the [`sqlite3`](https://docs.python.org/3/library/sqlite3.html) module – smac89 Apr 11 '22 at 15:58
  • Unless you can evaluate how long your script should last approximatively, I don’t see any way to achieve your goal. Even with log levels and so on, it will not be good. For me, the `if` solution is the best, whereas flagging the files would result in the worst impacts and would be executed for each file, for instance. – Floh Apr 11 '22 at 15:59
  • 1
    @smac89 I’m pretty sure that any call to sqlite3 would result in higher time than the print (or logging). And then it would be executed at any step like the « if » statement. – Floh Apr 11 '22 at 16:01
  • @Floh The suggestion to use sqlite3 has nothing to do with reducing the runtime cost of the program. I'm making an additional suggestion that will help OP avoid having to run his program multiple times if the initial runs fail. Storing progress in a db, will probably add at most 10 seconds to the entire runtime (which is in the order of hour. The savings you can achieve there cannot be expressed in Big-Oh notation. – smac89 Apr 11 '22 at 16:18
  • I agree @smac89 that to be able to survive a failure without starting again from scratch will have a global better impact than the if statement economy. – Floh Apr 12 '22 at 07:12

1 Answers1

1

Try using tqdm library, like this:

from tqdm import tqdm

for i in tqdm(range(<your_cycle_range>)):
    <your operations>

instead of

for i in range(<your_cycle_range>)):
    ....
    ....
    if n % 100 == 0:
        print("%d steps completed!" % n)

PS: this is given that you have a cycle inside your python script.

lemon
  • 14,875
  • 6
  • 18
  • 38
  • How is this more efficient than a simple `if`? – Pranav Hosangadi Apr 11 '22 at 15:51
  • @PranavHosangadi Nothing can be more efficient than if. But if author asks about other ways to implement - tqdm just makes it prettier – Dmitry Barsukoff Apr 11 '22 at 15:53
  • @PranavHosangadi - The simple if of the example doesn't give a hint of how far there is to go. tqdm does that in several different formats. tqdm is a very common way to give feedback and it makes a lot of sense to make it an answer here. – tdelaney Apr 11 '22 at 15:55
  • This example assumes that you already have a cycle. Read ps. There's no need of any further statements. @PranavHosangadi – lemon Apr 11 '22 at 15:56
  • @tdelaney sure this answers "how do I display progress?", but in the context of OP's question where they are worried an _`if` statement_ is too expensive, I don't think something that displays a whole progress bar is a good solution. The answer should mention this, otherwise it gives the impression that `tqdm` will add less overhead than an `if`. – Pranav Hosangadi Apr 11 '22 at 16:00
  • Updated the answer, `if` here is just an unneeded expression which adds up. – lemon Apr 11 '22 at 16:01