2

I have the following soup:

next ... From this I want to extract the href, "some_url"

this I want to extract the href, "some_url"

and the whole list of the pages that are listed on this page: https://www.catholic-hierarchy.org/diocese/laa.html

note: there are a whole lot of links to sub-pages: which i need to parse. at the moment: getting all the data out it : -dioceses -Urls -description -contact-data -etc. etx.

The example below will grab all URLs of dioceses, get some info about each of them and creates final dataframe. To speed-up the process multiprocessing.Pool is used:

but wait: how to get this scraper running without the support of the multiprocessing!? i want to run it in Colab - therefore in need to get rid of the multiprocessing-feature.

How to achieve this..!?

import requests
from bs4 import BeautifulSoup
from multiprocessing import Pool


def get_dioceses_urls(section_url):
    dioceses_urls = set()

    while True:
        print(section_url)

        soup = BeautifulSoup(
            requests.get(section_url, headers=headers).content, "lxml"
        )
        for a in soup.select('ul a[href^="d"]'):
            dioceses_urls.add(
                "https://www.catholic-hierarchy.org/diocese/" + a["href"]
            )

        # is there Next Page button?
        next_page = soup.select_one('a:has(img[alt="[Next Page]"])')
        if next_page:
            section_url = (
                "https://www.catholic-hierarchy.org/diocese/"
                + next_page["href"]
            )
        else:
            break

    return dioceses_urls


def get_diocese_info(url):
    print(url)

    soup = BeautifulSoup(requests.get(url, headers=headers).content, "html5lib")

    data = {
        "Title 1": soup.h1.get_text(strip=True),
        "Title 2": soup.h2.get_text(strip=True),
        "Title 3": soup.h3.get_text(strip=True) if soup.h3 else "-",
        "URL": url,
    }

    li = soup.find(
        lambda tag: tag.name == "li"
        and "type of jurisdiction:" in tag.text.lower()
        and tag.find() is None
    )
    if li:
        for l in li.find_previous("ul").find_all("li"):
            t = l.get_text(strip=True, separator=" ")
            if ":" in t:
                k, v = t.split(":", maxsplit=1)
                data[k.strip()] = v.strip()

    # get other info about the diocese
    # ...

    return data


if __name__ == "__main__":
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0"
    }

    # get main sections:
    url = "https://www.catholic-hierarchy.org/diocese/laa.html"
    soup = BeautifulSoup(
        requests.get(url, headers=headers).content, "html.parser"
    )

    main_sections = [url]
    for a in soup.select("a[target='_parent']"):
        main_sections.append(
            "https://www.catholic-hierarchy.org/diocese/" + a["href"]
        )

    all_data, dioceses_urls = [], set()
    with Pool() as pool:
        # get all dioceses urls:
        for urls in pool.imap_unordered(get_dioceses_urls, main_sections):
            dioceses_urls.update(urls)

        # get info about all dioceses:
        for info in pool.imap_unordered(get_diocese_info, dioceses_urls):
            all_data.append(info)

    # create dataframe from the info about dioceses
    df = pd.DataFrame(all_data).sort_values("Title 1")

    # save it to csv file
    df.to_csv("data.csv", index=False)
    print(df.head().to_markdown())

update: well see what i get back if i run the script on colab:

https://www.catholic-hierarchy.org/diocese/laa.htmlhttps://www.catholic-hierarchy.org/diocese/lab.html

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "<ipython-input-1-f5ea34a0190f>", line 21, in get_dioceses_urls
    next_page = soup.select_one('a:has(img[alt="[Next Page]"])')
  File "/usr/local/lib/python3.7/dist-packages/bs4/element.py", line 1403, in select_one
    value = self.select(selector, limit=1)
  File "/usr/local/lib/python3.7/dist-packages/bs4/element.py", line 1528, in select
    'Only the following pseudo-classes are implemented: nth-of-type.')
NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.
"""

The above exception was the direct cause of the following exception:

NotImplementedError                       Traceback (most recent call last)
<ipython-input-1-f5ea34a0190f> in <module>
     81     with Pool() as pool:
     82         # get all dioceses urls:
---> 83         for urls in pool.imap_unordered(get_dioceses_urls, main_sections):
     84             dioceses_urls.update(urls)
     85 

/usr/lib/python3.7/multiprocessing/pool.py in next(self, timeout)
    746         if success:
    747             return value
--> 748         raise value
    749 
    750     __next__ = next                    # XXX

NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.
thannen
  • 83
  • 5
  • 2
    any reason why you cannot replace `Pool` with `ThreadPool` ? – Ahmed AEK Sep 18 '22 at 19:24
  • 2
    also google colab does support multiprocessing. – Ahmed AEK Sep 18 '22 at 19:30
  • 1
    hi Ahmws AEK many thanks for the quick reply. I will retry to run this on Colab - in order to get as much of the data out of the ressource. I have had issues - while running the script on Colab. I will try to write down all the findings - i will do this as a additional update in the thread-opening. – thannen Sep 18 '22 at 20:00
  • 1
    i have updated the threadstart - and have written down what i get while i run the script on colab. plz advice – thannen Sep 18 '22 at 20:04
  • 2
    Can you rewrite the logic, for multi-threading? Something along these lines: https://www.geeksforgeeks.org/multithreading-python-set-1/ Are you familiar with the concept of a `queue` in python? – Barry the Platipus Sep 18 '22 at 20:39
  • 1
    the error in question is not related to multiprocessing or multithreading, its caused by beautiful soup not implementing something. – Ahmed AEK Sep 18 '22 at 21:06
  • 2
    He could always do a `pip install -U bs4` and then a `pip install -U soupsieve`. However, multiprocessing is not the best way forward here. Best would be to use an event loop and async it, ideally with a solution like httpx. @AhmedAEK – Barry the Platipus Sep 18 '22 at 21:17
  • 2
    @BarrythePlatipus is there a reason asyncio is preferred over multiprocessing in web scraping ? The benefit of using multiple cores is usually hard to outweight. – Ahmed AEK Sep 18 '22 at 22:58
  • 3
    See this: https://stackoverflow.com/questions/27435284/multiprocessing-vs-multithreading-vs-asyncio-in-python-3 – Barry the Platipus Sep 18 '22 at 23:24

2 Answers2

7

The following is one way of getting that information, in an async fashion (should work on Colab notebooks). I got the dioceses urls from a different part of the site (Structured view - World Regions). I would expect the dioceses count there to match the count from the letters list.

from httpx import Client, AsyncClient, Limits
from bs4 import BeautifulSoup as bs
import pandas as pd
import re
from datetime import datetime
import asyncio
import nest_asyncio

nest_asyncio.apply()

headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

big_df_list = []

def all_dioceses():
    dioceses = []
    root_links = [f'https://www.catholic-hierarchy.org/diocese/qview{x}.html' for x in range(1, 8)]
    with Client(headers=headers, timeout=60.0, follow_redirects=True) as client:
        for x in root_links:
            r = client.get(x)
            soup = bs(r.text)
            soup.select_one('ul#menu2').decompose()
            for link in soup.select('ul > li > a'):
                dioceses.append('https://www.catholic-hierarchy.org/diocese/' + link.get('href'))
    return dioceses
# print(all_dioceses())

async def get_diocese_info(url):
    async with AsyncClient(headers=headers, timeout=60.0, follow_redirects=True) as client:
        try:
            r = await client.get(url)
            soup = bs(r.text)
            d_name = soup.select_one('h1[align="center"]').get_text(strip=True)
            info_table = soup.select_one('div[id="d1"] > table')
            d_bishops = ' | '.join([x.get_text(strip=True) for x in info_table.select('td')[0].select('li')])
            d_extra_info = ' | '.join([x.get_text(strip=True) for x in info_table.select('td')[1].select('li')])
            big_df_list.append((d_name, d_bishops, d_extra_info, url))
            print('done', d_name)
        except Exception as e:
            print(url, e)

async def scrape_dioceses():
    start_time = datetime.now()
    tasks = asyncio.Queue()
    for x in all_dioceses():
        tasks.put_nowait(get_diocese_info(x))

    async def worker():
        while not tasks.empty():
            await tasks.get_nowait()
            
    await asyncio.gather(*[worker() for _ in range(100)])
    end_time = datetime.now()
    duration = end_time - start_time
    print('diocese scraping took', duration)

asyncio.run(scrape_dioceses())
df = pd.DataFrame(big_df_list, columns = ['Name', 'Bishops', 'Info', 'Url'])
print(df)

Result in terminal:

done Eparchy of Mississauga (Syro-Malabar)
done Eparchy of Mar Addai of Toronto (Chaldean)
done Eparchy of Saint-Sauveur de Montr�al (Melkite Greek)
done Diocese of Calgary
done Archdiocese of Winnipeg
[...]
diocese scraping took 0:03:02.366096

Name    Bishops Info    Url
0   Eparchy of Mississauga (Syro-Malabar)   JoseKalluvelil, Bishop  Type of Jurisdiction: Eparchy | Elevated:22 December2018 | Immediately Subject to the Holy See | Syro-Malabar Catholic Church of the Chaldean Tradition | Country:Canada | Mailing Address: Syro-Malabar Apostolic Exarchate, 6630 Turner Valley Rd., Mississauga, ON L5V 2P1, Canada | Telephone: (905)858-8200 | Fax: 858-8208    https://www.catholic-hierarchy.org/diocese/dmism.html
1   Eparchy of Mar Addai of Toronto (Chaldean)  Robert SaeedJarjis, Bishop | Bawai (Ashur)Soro, Bishop Emeritus Type of Jurisdiction: Eparchy | Erected:10 June2011 | Immediately Subject to the Holy See | Chaldean Catholic Church of the Chaldean Tradition | Country:Canada | Conference Region:Ontario | Mailing Address: 2 High Meadow Place, Toronto, ON M9L 2Z5, Canada | Telephone: (416)746-5816 | Fax: 746-5850  https://www.catholic-hierarchy.org/diocese/dtoch.html
2   Eparchy of Saint-Sauveur de Montr�al (Melkite Greek)    MiladJawish, B.S., Bishop   Type of Jurisdiction: Eparchy | Elevated:1 September1984 | Immediately Subject to the Holy See | Melkite Greek Catholic Church of the Byzantine Tradition | Country:Canada | Conference Region:Quebec | Web Site:http://www.melkite.com/ | Mailing Address: 10025 boul. de l'Arcadie, Montreal, QC H4N 2S1, Canada | Telephone: (514)272.6430 | Fax: 202.1274   https://www.catholic-hierarchy.org/diocese/dmome.html
3   Diocese of Calgary  William TerrenceMcGrattan, Bishop | Frederick BernardHenry, Bishop Emeritus Type of Jurisdiction: Diocese | Erected:30 November1912 | Metropolitan: Archdiocese ofEdmonton | Rite: Latin (or Roman) | Province: Alberta | Country:Canada | Square Kilometers: 110,500 (42,680 Square Miles) | Conference Region:West (Ouest) | Catholic Directory Abbreviation: Cal | Official Web Site:http://www.calgarydiocese.ca/ | Mailing Address: Catholic Pastoral Centre, Room 290, The Iona Building, 120-17th Avenue S.W., Calgary, AB T2S 2T2, Canada | Telephone: (403)218-5528 | Fax: 264-0272    https://www.catholic-hierarchy.org/diocese/dcalg.html
4   Archdiocese of Winnipeg Richard JosephGagnon, Archbishop | James VernonWeisgerber, Archbishop Emeritus  Type of Jurisdiction: Archdiocese | Erected:4 December1915 | Immediately Subject to the Holy See | Rite: Latin (or Roman) | Province: Manitoba | Country:Canada | Square Kilometers: 116,405 (44,961 Square Miles) | Conference Region:West (Ouest) | Catholic Directory Abbreviation: W | Official Web Site:http://www.archwinnipeg.ca/ | Mailing Address: Chancery Office, 1495 Pembina Highway, Winnipeg, MB R3T 2C6, Canada | Telephone: (204)452-2227 | Fax: 475-4409  https://www.catholic-hierarchy.org/diocese/dwinn.html
... ... ... ... ...
2619    Archiepiscopal Exarchate of Krym (Ukrainian)    Vacant | Makariy BohdanLeniv, O.S.B.M., Apostolic Administrator | MykhayloBubniy, C.SS.R., Archiepiscopal Administrator Type of Jurisdiction: Archiepiscopal Exarchate | Split:13 February2014 | Metropolitan: Archeparchy ofKyiv-Halyč {Kiev} (Ukrainian) | Ukrainian Catholic Church of the Byzantine Tradition | Country:Ukraine | Mailing Address: vul. Schmidta 22/12, 65000 Odessa, Ukraina | Telephone: (0482)32.58.90 | Fax: 32.58.89   https://www.catholic-hierarchy.org/diocese/dkrym.html
2620    Diocese of Lutsk    VitaliySkomarovskyi, Bishop | MarkijanTrofym’yak, Bishop Emeritus   Type of Jurisdiction: Diocese | Split:28 October1925 | Metropolitan: Archdiocese ofLviv | Rite: Latin (or Roman) | Country:Ukraine | Square Kilometers: 40,190 (15,523 Square Miles) | Official Web Site:http://catholic.volyn.ua/ | Mailing Address: Kuria Diecezjalna, vul. Katedralna 17, 43016 Lutsk, Ukraina | Telephone: (0332)72.15.32 | Fax: (same) https://www.catholic-hierarchy.org/diocese/dluts.html
2621    Diocese of Stockholm    AndersArborelius, O.C.D., Cardinal, Bishop  Type of Jurisdiction: Diocese | Elevated:29 June1953 | Immediately Subject to the Holy See | Rite: Latin (or Roman) | Country:Sweden | Square Kilometers: 450,295 (173,926 Square Miles) | Official Web Site:https://www.katolskakyrkan.se | Mailing Address: Katolska Biskopsambetet, Gotgatan 68, P.O. Box 4114, S-102 62 Stockholm, Sverige | Telephone: (08)462.66.02 | Fax: 702.05.55  https://www.catholic-hierarchy.org/diocese/dstos.html
2622    Archeparchy of Diarbekir (Amida) (Chaldean) RamziGarmou, Ist. del Prado, Archbishop Type of Jurisdiction: Archeparchy | Elevated:3 January1966 | Chaldean Catholic Church of the Chaldean Tradition | Country:Turkey | Mailing Address: Archeveche Chaldeen, Hamalbasi Caddesi 20, Galatasaray, 34435 Beyoglu, Istanbul, Turkiye | Telephone: (0212)252.34.49 | Fax: (same) https://www.catholic-hierarchy.org/diocese/ddiar.html
2623    Eparchy of Kolomyia (Ukrainian) VasylIvasyuk, Bishop    Type of Jurisdiction: Eparchy | Split:12 September2017 | Metropolitan: Archeparchy ofIvano-Frankivsk [Stanislaviv] (Ukrainian) | Ukrainian Catholic Church of the Byzantine Tradition | Country:Ukraine | Square Kilometers: 14,000 (5,407 Square Miles) | Official Web Site:https://kolugcc.org.ua | Mailing Address: vul. Ivana Franka 29, 78200 Kolomyia, Ukraina | Telephone: (06891)19.707 https://www.catholic-hierarchy.org/diocese/dkolo.html
2624 rows × 4 columns

As you can see, this code will pull the full info for 2.6k dioceses in approx 3 minutes, while using far less resources than multiprocessing or multithreading.

You will need to install the following (install or upgrade, just run these commands one by one in colab notebook):

pip install -U asyncio
pip install -U nest-asyncio
pip install -U httpx
pip install -U bs4
pip install -U pandas

I also imported re, in case you will want to select the bits of information one by one (Jurisdiction, Tradition, Address, website, and so on), each of them in a try/except block, to account for missing ones, and extend the list/dataframe accordingly. All packages above can be found on https://pypi.org/, and are documented.

Barry the Platipus
  • 9,594
  • 2
  • 6
  • 30
  • hi there - many many thanks for the awesome solution. I am overwhelmed. I have no pro account on Colab - so i try to run that all on the local machine. Therefore i need to set the local machine fresh with all the necessary things on it. BTW: have two options here a. a linux notebook and b. a win machine with anaconda - which one should i go!? Again many thanks - i love your solution - its so awesome – thannen Sep 19 '22 at 14:48
  • 2
    You're welcome @thannen. Don't forget to mark my answer as accepted, if it solved your issue (green checkmark under voting buttons). I would recommend the OS which you are most comfortable with, and then create a virtual machine with linux where you can experiment with python. – Barry the Platipus Sep 19 '22 at 14:52
  • hello dear @barry the Platipus - many many thanks for all you did - it is just awesome and you helped me alot. I am happy. BTW whats the difference in the script - if we work only woti the 2,6 K results!?! is the resulting script a bit simpler and would be able to run on -Collab - i am only a ordinary colab user and i have restricted set of tools & plugins over there. So it would be a pleasure for me if i could run it in colab too ! Look fowrward to hear from you -regards – thannen Oct 18 '22 at 20:50
  • hello dear Barry the Platipus : i have errors when running your code - i get back this: ModuleNotFoundError Traceback (most recent call last) in ----> 1 from httpx import Client, AsyncClient, Limits 2 from bs4 import BeautifulSoup as bs 3 import pandas as pd 4 import re 5 from datetime import datetime ModuleNotFoundError: No module named 'httpx' – thannen Oct 20 '22 at 19:38
  • hello dear Barry i get errors - i get back this when running this in the collab: --------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) in ----> 1 from httpx import Client, AsyncClient, Limits 2 from bs4 import BeautifulSoup as bs 3 import pandas as pd 4 import re 5 from datetime import datetime ModuleNotFoundError: No module named 'httpx' – thannen Oct 20 '22 at 19:39
2

problem with running script on google colab is that it currently only supports python 3.7, which doesn't support the newest version of beautifulsoup, so your a:has operator is not supported, i have replaced it with a loop on all a tags, which is slightly slower but the code works on google colab, and there is no need to remove multprocessing, but if you do need to remove multiprocessing then you should convert your functions into corountines and run them as tasks using asyncio as suggested by @Barry the Platipus.

def get_dioceses_urls(section_url):
    dioceses_urls = set()

    while True:
        print(section_url)

        soup = BeautifulSoup(
            requests.get(section_url, headers=headers).content, "lxml"
        )
        for a in soup.select('ul a[href^="d"]'):
            dioceses_urls.add(
                "https://www.catholic-hierarchy.org/diocese/" + a["href"]
            )

        # is there Next Page button?
        next_page = None
        for a in soup.find_all('a'):
            if a.img:
                if a.img["alt"] == "[Next Page]":
                  next_page = a
                  break
        if next_page:
            section_url = (
                "https://www.catholic-hierarchy.org/diocese/"
                + next_page["href"]
            )
        else:
            break

    return dioceses_urls
Ahmed AEK
  • 8,584
  • 2
  • 7
  • 23