4

I am trying to iterate over a Pandas data frame with close to a million entries. I am using a for loop to iterate over them. Consider the following code as an example

import pandas as pd 
import os 
from requests_html import HTMLSession
from tqdm import tqdm
import time


df = pd.read_csv(os.getcwd()+'/test-urls.csv')
df = df.drop('Unnamed: 0', axis=1 )

new_df = pd.DataFrame(columns = ['pid', 'orig_url', 'hosted_url'])
refused_df = pd.DataFrame(columns = ['pid', 'refused_url'])

tic = time.time()

for idx, row in df.iterrows():

    img_id = row['pid']
    url = row['image_url']

    #Let's do scrapping 
    session = HTMLSession()
    r  = session.get(url)
    r.html.render(sleep=1, keep_page=True, scrolldown=1)

    count = 0 
    link_vals =  r.html.find('.zoomable')

    if len(link_vals) != 0 : 
        attrs = link_vals[0].attrs
        # print(attrs['src'])  
        embed_link = attrs['src']

    else: 
        while count <=7:
            link_vals =  r.html.find('.zoomable')
             count += 1
        else:
             print('Link refused connection for 7 tries. Adding URL to Refused URLs Data Frame')
            ref_val = [img_id,URL]
            len_ref = len(refused_df)
            refused_df.loc[len_ref] = ref_val
            print('Refused URL added')
            continue
    print('Got 1 link')

#Append scraped data to new_df
    len_df = len(new_df)
    append_value = [img_id,url, embed_link]
    new_df.loc[len_df] = append_value

I wanted to know how could I use a progress bar to add a visual representation of how far along I am. I will appreciate any help. Please let me know if you need any clarification.

sanster9292
  • 1,146
  • 2
  • 9
  • 25
  • what are you trying to do? you could just print a percentage if your index is ordered index / shape. I agree with Robbwh, if you're using iterrows your probably doing it wrong. – Umar.H Jun 19 '20 at 20:13
  • I am trying to scrape some data from URLs. I added the code above. Please let me know if you think I can make any alterations – sanster9292 Jun 19 '20 at 20:29

4 Answers4

4

You could try out TQDM

from tqdm import tqdm
for idx, row in tqdm(df.iterrows()):
      do something

This is primarily for a command-line progress bar. There are other solutions if you're looking for more of a GUI. PySimpleGUI comes to mind, but is definitely a little more complicated.

1

Would comment, but the reason you might want a progress bar is because it is taking a long time because iterrows() is a slow way to do operations in pandas.

I would suggest you use apply/ avoid using iterrows().

If you want to continue using iterrows just include a counter that counts up to the number of rows, df.shape[0]

robbwh
  • 337
  • 2
  • 9
  • I am trying to scrape some data from URLs. I added the code above. Please let me know if you think I can make any alterations – sanster9292 Jun 19 '20 at 20:24
  • It's not quite as simple as 'apply is fast than iterrows', but it's in the right direction. https://stackoverflow.com/questions/24870953/does-pandas-iterrows-have-performance-issues – tomaszps Jun 19 '20 at 20:26
  • @sanster9292 ah I thought you were mostly doing within dataframe operations. If performance is egregiously slow I would consider finding a way to parallelize these operations, although I have limited experience with this, and this could be a silly suggestion. Best of luck. – robbwh Jun 19 '20 at 20:43
  • @robbwh the speed is fine i just wanted a visual representation of how far along i was since i will be crawling close to 700k urls – sanster9292 Jun 20 '20 at 14:45
1

PySimpleGUI makes this about as simple of a problem to solve as possible, assuming you know ahead of time time how items you have in your list. Indeterminate progress meters are possible, but a little more complicated.

There is no setup required before your loop. You don't need to make a special iterator. The only need you have to do is add 1 line of code inside your loop.

Inside your loop add a call to - one_line_progress_meter. The name sums up what it is. Add this call to the top of your loop, the bottom, it doesn't matter... just add it somewhere that's looped.

There 4 parameters you pass are:

  • A title to put on the meter (any string will do)
  • Where you are now - current counter
  • What the max counter value is
  • A "key" - a unique string, number, anything you want.

Here's a loop that iterates through a list of integers to demonstrate.

import PySimpleGUI as sg

items = list(range(1000))
total_items = len(items)
for index, item in enumerate(items):

    sg.one_line_progress_meter('My meter', index+1, total_items, 'my meter' )

The list iteration code will be whatever your loop code is. The line of code to focus on that you'll be adding is this one:

sg.one_line_progress_meter('My meter', index+1, total_items, 'my meter' )

This line of code will show you the window below. It contains statistical information like how long you've been running the loop and an estimation on how much longer you have to go.

enter image description here

Mike from PSG
  • 5,312
  • 21
  • 39
0

How to do that in pandas apply? I do this

def some_func(a,b):
   global index
   some function involve a and b
   index+=1
   sg.one_line_progress_meter('My meter', index, len(df), 'my meter' )
   return c

index=0
df['c'] = df[['a','b']].apply(lambda : some_func(*x),axis=1)
  • If you have a new question, please ask it by clicking the [Ask Question](https://stackoverflow.com/questions/ask) button. Include a link to this question if it helps provide context. - [From Review](/review/late-answers/31235258) – jazzpi Mar 10 '22 at 22:22