I am actively running some Python code in jupyter on a df consisting of about 84k rows. I'm estimating this is going to take somewhere in the neighborhood of 9 hours at this rate. My code is below, I have read that ideally one would vectorize for max speed but being sort of new to Python and coding in general, I'm not sure how I can go about changing the below code to vectorize it. The goal is to look at the value in the first column of the dataframe and add that value to the end of a url. I then check the first line in the url and compare it to some predetermined values to find out if there is a match. Any advice would be greatly appreciated!
#Python 3
import pandas as pd
import urllib
no_res = "Item Not Found"
error = "Does Not Exist"
for i in df1.index:
path = 'http://xxxx/xxx/xxx.pl?part=' + str(df1['ITEM_ID'][i])
parsed_path = path.replace(' ','%20')
f = urllib.request.urlopen(parsed_path)
raw = str(f.read().decode("utf-8"))
lines = raw.split('\n')
r = lines[0]
if r == no_res:
sap = 'NO'
elif r == error:
sap = 'ERROR'
else:
sap = 'YES'
df1["item exists"][i] = sap
df1["Path"][i] = path
df1["URL return value"][i] = r
Edit adding test code below
import concurrent.futures
import pandas as pd
import urllib
import numpy as np
def my_func(df_row):
no_res = "random"
error = "entered"
path = "http://www.google.com"
parsed_path = path.replace(' ','%20')
f = urllib.request.urlopen(parsed_path)
raw = str(f.read().decode("utf-8"))
lines = raw.split('\n')
r = df_row['0']
if r == no_res:
sap = "NO"
elif r == error:
sap = "ERROR"
else:
sap = "YES"
df_row['4'] = sap
df_row['5'] = lines[0]
df_row['6'] = r
n = 1000
my_df = pd.DataFrame(np.random.choice(['random','words','entered'], size=(n,3)))
my_df['4'] = ""
my_df['5'] = ""
my_df['6'] = ""
my_df = my_df.apply(lambda col: col.astype('category'))
executor = concurrent.futures.ProcessPoolExecutor(8)
futures = [executor.submit(my_func, row) for _,row in my_df.iterrows()]
concurrent.futures.wait(futures)
This is throwing the following error (shortened):
DoneAndNotDoneFutures(done={<Future at 0x1cfe4938040 state=finished raised BrokenProcessPool>, <Future at 0x1cfe48b8040 state=finished raised BrokenProcessPool>,