An accurate progress bar for loading files and transforming data using Vaex and Pandas

Question

I am looking for the method to include a progress bar to see the remaining time for loading a file with Vaex (big data files) or transform big data with Panda. I have checked this thread https://stackoverflow.com/questions/3160699/python-progress-bar, but unfortunately, all the progress bar codes are absolutely inaccurate for my needs because the command or the code already finished before the progress bar was complete (absolutely fail). I am looking for something similar to %time in which the time spent by a line, or a command, is printed out. In my case I want to see the estimation time and the progress bar for any command without using a for-loop.

Here is my code:

from progress.bar import Bar

with Bar('Processing', max=1) as bar:
        %time sample_tolls_amount=df_panda_tolls.sample(n = 4999);
        bar.next()
        
Processing |################################| 1/1CPU times: total: 11.1 s
Wall time: 11.1 s

The for loop is unneccesary because I need to run this command once. Actually, with the for loop, the progress bar was still running when the data (sample_tolls_amount) was done (in the case of max=20). Is there any way to check feasibly the progress of any command? Just like &time does.

I have tried several functions but all of them fail to show the real progress of the command. I don't have for loops. I have commands to load or trandform big data files. Therefore, I want to know the progress done and the remaining time every time I run a code with my commands. Just like dowloading a file from the browser: you see how many Gb has been dowloaded and how much data remain to download. I am looking for something easy to apply. Easy like %time (%progress).

Many vaex methods already have progress bars included. Otherwise, you can look at this example: https://vaex.readthedocs.io/en/latest/guides/progressbars.html — Joco, Oct 29 '22 at 19:57
@Joco, I have checked this method, but I thought it was only possible for vaex-related commands. However, I run different commands using numpy and reserchpy and this feature also works for all of them. Now, another question arises: Can I make a short cut for ` with vaex.progress.tree('rich', title="My Vaex computations")` in order to write this command faster? — José Miguel Rego Terol, Oct 30 '22 at 09:10
Oh i didn't even know it could work for non-vaex related stuff. that's great. As for a short-cut.. i have no idea.. i guess you can do something like `import vaex.progress.tree as my_progress` and then do `with my_progress(...)` Of course instead of `my_progress` you can use any name you want — Joco, Oct 30 '22 at 11:20

score 0 · Answer 1 · answered Oct 28 '22 at 23:33

0

i use these two progress bar variants that do not require imports and one can embed into the code quite easily.

simple progress bar:

import time


n = 25
for i in range(n):
    time.sleep(0.1)
    progress = int(i / n * 50)
    print(f'running {i+1} of {n} {progress*"."}', end='\r', flush=True)

more elaborate progress bar:

import time

def print_progressbar(total, current, barsize=60):
    progress = int(current*barsize/total)
    completed = str(int(current*100/total)) + '%'
    print('[', chr(9608)*progress, ' ', completed, '.'*(barsize-progress), '] ', str(current)+'/'+str(total), sep='', end='\r', flush=True)



total = 600
barsize = 60
print_frequency = max(min(total//barsize, 100), 1)
print("Start Task..")
for i in range(1, total+1):
    time.sleep(0.0001)
    if i%print_frequency == 0 or i == 1:
        print_progressbar(total, i, barsize)
print("\nFinished")

answered Oct 28 '22 at 23:33

D.L

4,339
5
22
45

This is exactly where my problem lies. Let's use the first (simple) progress bar: The code should be integrated within the for loop, right? Let's take the code I want to check the progress ` %time sample_tolls_amount=df_panda_tolls.sample(n = 4999); ` Now, if n=25, the for loop will be run 25 times. In other words, the command ` df_data.sample (n = 4999)` will be created 25 times because it is included in the for loop. I do not want 25 times the same variable. I want to check the command progress. Please, do not hesitate to correct me if I am wrong in my explanation. – José Miguel Rego Terol Oct 30 '22 at 08:35
so you want to see what percentage of the `df_data.sample (n = 4999)` you have completed ? so at n=2500 completed would be ~50%.... is that what you mean ? – D.L Oct 30 '22 at 09:17
Exactly!! I want to see a progress bar that represents the pregression of the command. As you said: at n=2500 completed would be ~50%. – José Miguel Rego Terol Oct 30 '22 at 10:56

An accurate progress bar for loading files and transforming data using Vaex and Pandas

1 Answers1