1

I am actively running some Python code in jupyter on a df consisting of about 84k rows. I'm estimating this is going to take somewhere in the neighborhood of 9 hours at this rate. My code is below, I have read that ideally one would vectorize for max speed but being sort of new to Python and coding in general, I'm not sure how I can go about changing the below code to vectorize it. The goal is to look at the value in the first column of the dataframe and add that value to the end of a url. I then check the first line in the url and compare it to some predetermined values to find out if there is a match. Any advice would be greatly appreciated!

#Python 3

import pandas as pd
import urllib

no_res = "Item Not Found"
error = "Does Not Exist"

for i in df1.index:

   path = 'http://xxxx/xxx/xxx.pl?part=' + str(df1['ITEM_ID'][i])
   parsed_path = path.replace(' ','%20')
   f = urllib.request.urlopen(parsed_path)
   raw = str(f.read().decode("utf-8"))
   lines = raw.split('\n')
   r = lines[0]

   if r == no_res:
      sap = 'NO'
   elif r == error:
      sap = 'ERROR'
   else:
      sap = 'YES'

df1["item exists"][i] = sap
df1["Path"][i] = path
df1["URL return value"][i] = r

Edit adding test code below

import concurrent.futures
import pandas as pd
import urllib
import numpy as np

def my_func(df_row):
    no_res = "random"
    error = "entered"
    path = "http://www.google.com"
    parsed_path = path.replace(' ','%20')
    f = urllib.request.urlopen(parsed_path)
    raw = str(f.read().decode("utf-8"))
    lines = raw.split('\n')
    r = df_row['0']

    if r == no_res:
        sap = "NO"
    elif r == error:
        sap = "ERROR"
    else:
        sap = "YES"

    df_row['4'] = sap
    df_row['5'] = lines[0]
    df_row['6'] = r
   
n = 1000
my_df = pd.DataFrame(np.random.choice(['random','words','entered'], size=(n,3)))
my_df['4'] = ""
my_df['5'] = ""
my_df['6'] = ""
my_df = my_df.apply(lambda col: col.astype('category'))

executor = concurrent.futures.ProcessPoolExecutor(8)
futures = [executor.submit(my_func, row) for _,row in my_df.iterrows()]
concurrent.futures.wait(futures)

This is throwing the following error (shortened):

DoneAndNotDoneFutures(done={<Future at 0x1cfe4938040 state=finished raised BrokenProcessPool>, <Future at 0x1cfe48b8040 state=finished raised BrokenProcessPool>,

vudu
  • 15
  • 4
  • 1
    You won't be able to vectorize code that relies on `request`, you can probably parallelize it as I/O (not CPU) is likely the bottleneck. – mozway Jul 20 '22 at 19:58
  • I appreciate the quick response! Thank you for the info :) – vudu Jul 20 '22 at 20:00

1 Answers1

1

Since you are doing some outside operation with a URL, I do not think vectorization is a solution (let possible).

The bottleneck of your operation is the following line

f = urllib.request.urlopen(parsed_path)

This line waits for the response and is blocking, as mentioned your operation is I/O bound. The CPU can start other jobs while waiting for the response. The solution to address this is using concurrency.

Edit: My original answer was using python built-in multi threading which was problematic. The best way to do multiprocessing/threading with pandas data frame is using "dask" library.

The following code is tested with the dummy data set on my PC and on average speeds up the naive for loop by ~ 12 times.

#%%
import time
import urllib.request

import pandas as pd
import numpy as np

import dask.dataframe as dd

def my_func(df_row):
    df_row = df_row.copy()
    no_res = "random"
    error = "entered"
    path = "http://www.google.com"
    parsed_path = path.replace(' ','%20')
    f = urllib.request.urlopen(parsed_path)
    # I had to change the encoding on my machine.
    raw = str(f.read().decode("'windows-1252")) 
    lines = raw.split('\n')
    r = df_row[0]

    if r == no_res:
        sap = "NO"
    elif r == error:
        sap = "ERROR"
    else:
        sap = "YES"

    df_row['4'] = sap
    df_row['5'] = lines[0]
    df_row['6'] = r
    return df_row

def run():
    print("started.")
    n = 1000
    my_df = pd.DataFrame(np.random.choice(['random','words','entered'], size=(n,3)))
    my_df = my_df.apply(lambda col: col.astype('category'))
    my_df['4'] = ""
    my_df['5'] = ""
    my_df['6'] = ""

    # Literally dask partitions the original dataframe into
    # npartitions chunks and use them in apply function
    # in parallel.
    my_ddf = dd.from_pandas(my_df, npartitions=15)
    start = time.time()
    q = my_ddf.apply(my_func, axis= 1, meta=my_ddf)
    # num_workers is number of threads used,
    print(q.compute(num_workers= 50))
    time_end = time.time()
    print(f"Elapsed: {time_end - start:10.2f}")

if __name__ == "__main__":
    run()

dask provides many other tools and options to facilitate concurrent processing and it would be a good idea to take a look at its documentation to investigate other options.

P.S. : if you run the above code too many times on google you will receive "HTTP Error 429: Too Many Requests". This happens to prevent something like DDoS attack on a public server. So, if for your real job you are querying a public website, you may end up receiving the same 429 response, if you try 84K queries in a short time.

Farshid
  • 446
  • 2
  • 9
  • I couldn't seem to get this to work on my work PC so I moved over to my personal and threw in some dummy tasks to test it out but I'm getting the same error here as well; clearly goofing something up along the way. I'll edit the original post with my test code. – vudu Jul 21 '22 at 00:24
  • I used the dummy data frame, you provided and the code did not work properly. The pandas data frames are not thread safe and using built in multi thread library was not a good idea, I edited the answer using dask library that is for multiprocessing with pandas and numpy libraries. – Farshid Jul 21 '22 at 15:59
  • For the real code try running the code without dask ( just replace the dask dataframe with the pandas one, something like my_df.apply(the function, axis = 1 )) and see if it throws the error. If that's the case, you probably using wrong column name. – Farshid Jul 22 '22 at 01:38
  • Can't thank you enough for taking the time to help with this. Problem solved! Thanks for the tip on dask. Will definitely take your advice and get more familiar with it. – vudu Jul 22 '22 at 01:48
  • Real code ended up having a simple goof up on my end...had to put it aside for other action items but it was a simply oversight...left off the "return df_row". The final runtime was 58 minutes down from 9 HOURS so huge help...thanks again! – vudu Jul 22 '22 at 01:49