Why my code still slow after threading for 15k records only, how to fix this

Question

I have a script, taking links from a file, visiting it, getting re-directed links, storing it back. But it works too slow on a file with 15k records. How can I make it quick? already used threading Please do help to fix it out!, I've tried multiple ways, threadings but I cannot make it quick. Is there any solution to my problem by any chance? any expert who could help me out.

import concurrent.futures
import sys
import pandas as pd
import requests
from threading import Thread
from queue import Queue
out_put_file=""
linkes = None
out = []
urls = []
old = []
file_name =None
concurrent = 10000
q = None
count=0
df =None
def do_work():
    while True:
        global q
        url = q.get()
        res = get_status(url)
        q.task_done()
def get_status(o_url):
    try:
        res = requests.get(o_url)
        if res:
            out.append(res.url)
            old.append(o_url)
        print(count)
        count=count+1
        return [res.status_code,res.url ,o_url]
    except:
        pass
    return [ans.status_code,ans.url,url]
def process_data():
        global q
        global file_name
        global linkes
        global df
        file_name = input("Enter file name : ")
        file_name = file_name.strip()
        print("Generating .......")
        df = pd.read_csv(file_name+".csv")
        old_links =df["shopify"]
        for i in old_links:
            if type(i)!=str:
                urls.append(i)
                continue
            if not i.startswith("http"):
                
                linkes = "http://"+i 
                urls.append(linkes)
            else:
                urls.append(i)
        df["shopify"]=urls
        q = Queue(concurrent * 2)
        for i in range(concurrent):
            t = Thread(target=do_work)
            t.daemon = True
            t.start()
        try:
            for url in urls:
                if type(url)!=str:
                    continue
                q.put(url.strip())
            q.join()
        except KeyboardInterrupt:
            sys.exit(1)
process_data()
for i in range (len(df['shopify'])):
    for j in range(len(old)):
        if df['shopify'][i]==old[j]:
            df['shopify'][i]=out[j]
df = df[~df['shopify'].astype(str).str.startswith('http:')]
df = df.dropna()
df.to_csv(file_name+"-new.csv",index=False)

Email,shopify,Proofy_Status_Name
hello@knobblystudio.com,http://puravidabracelets.myshopify.com,Deliverable
service@cafe-select.co.uk,cafe-select.co.uk,Deliverable
mtafich@gmail.com,,Deliverable
whoopies@stevessnacks.com,stevessnacks.com,Deliverable
customerservice@runwayriches.com,runwayriches.com,Deliverable
shop@blackdogride.com.au,blackdogride.com.au,Deliverable
anavasconcelos.nica@gmail.com,grass4you.com,Deliverable
info@prideandprestigehair.com,prideandprestigehair.com,Deliverable
info@dancinwoofs.com,dancinwoofs.com,Deliverable

Please [edit] your question to add the first few lines of the CSV file. Also please don't name a variable the same as a module (`concurrent`) as that can cause problems. By the way, your load_url function is unused. — Nathan Mills, Nov 15 '22 at 00:12
I've added that csv file data (10 rows) and yes load_url isn't used, ignore it. — Nabi Bux, Nov 15 '22 at 12:07

Nathan Mills · Answer 1 · 2022-11-15T19:55:44.713

Threads in Python do not run simultaneously due to the Global Interpreter Lock. You might want to use the multiprocessing module instead, or ProcessPoolExecutor() from concurrent.futures. If you decide to use ProcessPoolExecutors, pass the URLs to the callback and have the callback return the old and redirected URL which should be returned by the result method of the future you get from the executor.submit. When using processes, global variables are not shared, unlike threads.

There has been an attempt to remove the global interpreter lock but without the GIL, Python doesn't run quite as fast or something like that if I remember correctly.

Something like the following might work. I renamed the concurrent variable because it would shadow the concurrent module and probably cause an error. ~~This code is untested because I don't have the csv file to test with.~~

import concurrent.futures
from concurrent.futures import ProcessPoolExecutor
import sys
import pandas as pd
import requests
import numpy as np
from threading import Thread
from queue import Queue
out_put_file=""
linkes = None
out = []
urls = []
old = []
futures = []
file_name =None
concurrent_ = 10000
q = None
count=0
df =None
def do_work(urls):
    results = []
    for url in urls:
        res = get_status(url)
        if res:
            results.append((res[2], res[1]))
        else:
            results.append((url, url))
    return results
def get_status(o_url):
    try:
        res = requests.get(o_url)
        if res:
            out.append(res.url)
            old.append(o_url)
        #print(count)
        #count=count+1
        return [res.status_code,res.url ,o_url]
    except:
        pass
def load_url(url, timeout):
    ans = requests.get(url, timeout=timeout)
    return [ans.status_code,ans.url,url]
def process_data():
        global q
        global file_name
        global linkes
        global df
        global urls
        file_name = input("Enter file name : ")
        file_name = file_name.strip()
        print("Generating .......")
        df = pd.read_csv(file_name+".csv")
        old_links =df["shopify"]
        for i in old_links:
            if type(i)!=str:
                urls.append(i)
                continue
            if not i.startswith("http"):
                
                linkes = "http://"+i 
                urls.append(linkes)
            else:
                urls.append(i)
        df["shopify"]=urls
        workers = 50
        with ProcessPoolExecutor(max_workers=workers) as executor:
            url_arrays = np.array_split(urls, workers)
            for urls in url_arrays:
                f = executor.submit(do_work, urls)
                futures.append(f)
process_data()
df['shopify'] = [res[1] for f in concurrent.futures.as_completed(futures) for res in f.result()]
df = df[~df['shopify'].astype(str).str.startswith('http:')]
df = df.dropna()
df.to_csv(file_name+"-new.csv",index=False)

How can I use multi-processing in this case? can u adjust the code? — Nabi Bux, Nov 14 '22 at 13:25
Help me to fix this, took me a lot of time trying different things but cannot make it work. — Nabi Bux, Nov 14 '22 at 20:59
The above code kept prompting "filename', it is not going inside the do_work() method. I'm not sure why, it keep asking to enter filename for some unknown reasons. — Nabi Bux, Nov 15 '22 at 11:52
It asks to enter the filename like (max_workers) time, max_workers are 50, it will ask 50 times. When I enter the filename, it gives an error "an attempt to run other process while other are already running" something like this. — Nabi Bux, Nov 15 '22 at 11:55
I've added 10 rows of my data, if you wish to test it at your side. It would be a csv file. — Nabi Bux, Nov 15 '22 at 12:06
Are you sure you copied the code correctly? It only prompts once every time I run the script on my computer. Did you accidentally write `executor.submit(process_data)`? That won't work and would give the behavior you describe of asking for the filename several times. — Nathan Mills, Nov 15 '22 at 20:20
I exactly run the code you sent, its not working at my side, asking me to enter file name again and again, and I didn't write anything to code. — Nabi Bux, Nov 15 '22 at 22:13

Why my code still slow after threading for 15k records only, how to fix this

1 Answers1