0

Let's say I have a dataframe full of data, a column containing different urls and I want to scrape a price on the page of each url of the dataframe (which is pretty big, more than 15k lines). And I want this scraping to run continously (when it reaches the end of the urls, it starts over again and again). The last column of the dataframe (prices) would be updated everytime a price is scraped.

Here is a visual example of a toy dataframe :

Col 1 ... Col N  URL                             Price
XXXX  ... XXXXX  http://www.some-website1.com/   23,5$
XXXX  ... XXXXX  http://www.some-website2.com/   233,5$
XXXX  ... XXXXX  http://www.some-website3.com/   5$
XXXX  ... XXXXX  http://www.some-website4.com/   2$
.
.
.

My question is : What is the most efficient way to scrape those URLs using a parallel method (multi-threading ...) knowing that I can implement the solution with request/selenium/bs4 ... (I can learn pretty much anything) So I would like a theoretical answer more than some lines of codes but if you have a block to send don't hesitate :)

Thank you

Jinter
  • 226
  • 2
  • 10
  • 1
    I think you're targetting wrong part of your problem here. You don't really need multi-threading for web-scraping - it's an IO bound task. Instead you should take a look at pythons async/await (asyncio) functionality which will allow you concurrent scraping of many targets without the need of threading. – Granitosaurus Oct 05 '21 at 05:46

1 Answers1

1

You can use next example how to check URLs periodically. It uses itertools.cycle with df.iterrows. This generator is then used in Pool.imap_unordered to get the data:

import requests
from time import sleep
from itertools import cycle
from bs4 import BeautifulSoup
from multiprocessing import Pool

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0"
}


def get_data(tpl):
    idx, row = tpl

    r = requests.get(row["url"], headers=headers)
    soup = BeautifulSoup(r.content, "html.parser")

    sleep(1)

    return (
        idx,
        soup.find(class_="Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)").text,
    )


if __name__ == "__main__":

    c = cycle(df.iterrows())

    with Pool(processes=2) as p:
        for i, (idx, new_price) in enumerate(p.imap_unordered(get_data, c)):
            df.loc[idx, "Price"] = new_price

            # print the dataframe only every 10th iteration:
            if i % 10 == 0:
                print()
                print(df)
            else:
                print(".", end="")

Prints:

...

                                     url   Price
0  https://finance.yahoo.com/quote/AAPL/  139.14
1  https://finance.yahoo.com/quote/INTC/   53.47
.........

...and so on

df used:

                                     url
0  https://finance.yahoo.com/quote/AAPL/
1  https://finance.yahoo.com/quote/INTC/
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Thank you for your answer. I had this idea too, but I have seen on this post ( https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas ) that all the "iter"-ish pandas functions were not optimal at all. I always try to vectorize my operations with ".apply" or with list comprehension. I already tried to implement multi threading with pandas apply but didn't manage to make it work. – Jinter Oct 04 '21 at 23:53
  • @Jinter Well, you're making HTTP requests, so that cannot be "vectorized" (in *numerical* sense). Simply iterate over each row, make a request and store the new data. `multiprocessing.Pool` will parallelize it. – Andrej Kesely Oct 04 '21 at 23:58
  • 1
    I've thought about it during the night and yes, you're right. Because of the duration of each request, the gain in time if I managed to use pandas apply to fill in the df would be very low. I will use your solution thank you and have a good day – Jinter Oct 05 '21 at 07:41