0

I have a df which contains a column of urls, when these urls are visited, you will be redirected to the real urls. I need to loop through all the urls one by one and get their corresponding redirect url back like this:

def get_redirect_url(url):
    #some code
    return redirect_url

df.url.apply(get_redirect_url)

If the df has 100 lines of data, it takes about 3 minutes for the apply method to finish. But sometimes, I may have a df with 5000+ lines which takes an hour to finish. I wonder if there is anyways to speed up the operation.

Is it possible to have multiple threads running at the same time to speed up?


Update

The solution shared by @GiantsLoveDeathMetal works!

Cheng
  • 16,824
  • 23
  • 74
  • 104
  • 1
    Yes, see https://stackoverflow.com/questions/26784164/pandas-multiprocessing-apply – foxyblue Oct 10 '17 at 09:59
  • Are you restricted to the use of `get_redirect_url()`? Or could you, say, fetch all (url, redirect_url) pairs and map to your df using vectorized operations? What's slowing you down is definitely the use of apply, but I'm assuming you're somehow bound to use it, since it's in your question title. – Plasma Oct 10 '17 at 10:16
  • @Plasma how do you do that? – Cheng Oct 10 '17 at 10:47
  • Depends on the how you get your redirect_urls. If you somehow manage to get a DF with all url-redirect pairs, you could do something like: `combined = df.merge(df2, on="url")` where `df2` is a dataframe with two columns `["url", "redirect_url"]`. But it depends entirely on how you get your redirects, and that that process can be sped up, like parallellized or just cached or whatever. Did this make sense? – Plasma Oct 10 '17 at 11:09
  • I see what you mean. I have to rely on the `requests` lib to grab the redirect url like this `r = requests.get(url.strip()) ` – Cheng Oct 10 '17 at 11:15
  • Possible duplicate of [pandas multiprocessing apply](https://stackoverflow.com/questions/26784164/pandas-multiprocessing-apply) – foxyblue Oct 20 '17 at 12:24

0 Answers0