3

I am scraping a yellow page to get the name of all physiotherapists in a city. With the url I get the list of 50 physiotherapists, however, when I expand the page, the url does not change. How do I get the full list of names?

This is how I get the list of physiotherapist in city of Rostock.

url = 'https://www.gelbeseiten.de/Suche/Physiotherapie%20praxis/Rostock'
req = requests.get(url, headers= header)
soup = BeautifulSoup(req.content, 'html.parser')

names = []

business_name = soup.find_all('h2', attrs ={"data-wipe-name":"Titel"})
for name in business_name:
    
    names.append(name.get_text())

At the buttom of the url there is a button called Mehr Anzeigen, basically saying "show more". If I click there, the number of entries for physiotherapists changes from 50-60. There are entries for 90 physiotherapists. When I click the button multiple times, showing all the entries, the button disappears. This lists all the physiotherapists in the city, I want to get this.

How do I get all the entries I get after clicking "show more"?

  • Do you know about web automation? Using selenium webDriver you can automate the 'button click' action and get the page with all the names, then you can scarp the data as you have already done – Ankush Pandit Jun 03 '21 at 11:11
  • I haven't checked the website, but it happens when there is an xhr request to fetch the next items. You have to check network requests and find the correct request. Then you can maybe use it. – Tugay Jun 03 '21 at 11:12
  • @AnkushPandit No i am not aware of it. Can you please link me to the right pages. Thanks!! – user16093299 Jun 03 '21 at 11:16
  • There are many resources on internet from where you can learn it, a simple example is here: https://www.geeksforgeeks.org/browser-automation-using-selenium/. You have to download selenium first in your python environment, then you need also need to install selenium webDriver (for Chrome or Mozilla whatever you are using). – Ankush Pandit Jun 03 '21 at 11:21
  • You can make call to [URL](https://www.gelbeseiten.de/AjaxSuche) from preview it showing 200 but manually it showing error btw you can find it in `xhr` – Bhavya Parikh Jun 03 '21 at 11:34

2 Answers2

2

There's no need to use Selenium for this simple task. By using Chrome's developer tools, you can observe that the website uses a simple POST request to https://www.gelbeseiten.de/AjaxSuche when pressing the 'Mehr anzeigen' button containing the following data:

umkreis: -1
WAS: Physiotherapie praxis
WO: rostock
position: 51
anzahl: 10
sortierung: relevanz

The json response contains a html key containing all your search results. Additionally, there are gesamtanzahlTreffer and anzahlTreffer keys inside the response. Unfortunately, it's not possible to get all search results with a single POST request by setting position=0 and anzahl=100. However, the first POST request contains the first 50 results (similar to the website) and by each new POST request we can obtain the next 10 results.

Long story short, you can parse all the results like this:

def post_ajax_search(was: str, wo: str, pos: int):
    req = requests.post("https://www.gelbeseiten.de/AjaxSuche", data={
        'umkreis': -1, 'WAS': was, 'WO': wo, 'position': pos, 'sortierung': 'relevanz'})
    r = req.json()
    return [r[key] for key in ("gesamtanzahlTreffer", "html", "anzahlTreffer")]


def parse_html(html: str) -> list[str]:
    soup = BeautifulSoup(html, "lxml")
    return [i.text for i in soup.find_all("h2", {"data-wipe-name": "Titel"})]


def parser(was: str, wo: str) -> list[str]:
    total_treffer, html, parsed_treffer = post_ajax_search(was, wo, 0)
    all_items = parse_html(html)
    i = 0
    while parsed_treffer < total_treffer:
        _, html, treffer = post_ajax_search(was, wo, 51 + i)
        all_items += parse_html(html)
        parsed_treffer += treffer
        i += 10
    return all_items

for praxis in (praxen := parser("Physiotherapie praxis", "rostock")):
    print(praxis)

Output:

Göllner Sabine Krankengymnastik & Physiotherapie
Friemel Physiotherapie Inh. B. Neumann Krankengymnastik & Physiotherapie
Nehrenberg Dorothee Physiotherapie
Physiotherapiezentrum Marcel Frank
Silke Thiede Physiotherapie
Physiotherapie Kollmorgen
Buller Olaf Physiotherapie
Gemeinschaftspraxis Physiotherapie Möller & Norden
Physiotherapie Annekathrin Hinz
Physiotherapie Hinz Annekathrin Praxis für Physiotherapie
Physiotherapie K. Schuldt
Physiotherapie Richter Ralf-Uwe Physiotherapie
Sport-Physio Rostock, Inh. Tschiersch, Daniel Physiotherapie
Klimt Dagmar Physiotherapie
MedPrevio
Pause Andrea Physiotherapiepraxis
Sörgel Steffen
Doremans Monika Physiotherapie
Doremans Monika Physiotherapie
Friemel B. Physiotherapie
Physiotherapie Vital Speicher Katja Oestreich
Jürß Katherina Physiotherapie
Pietralczyk Regina Physiotherapie
Stoll Sven Physiotherapie
Tübbecke Carola Physiotherapie
Physiotherapie Reiser u. Behrens
Physiotherapeutische Praxis Rose
Arndt K. Physiotherapie
Arndt K. Physiotherapie
Hieke Gunnar Praxis für Physiotherapie
PTB Physiopraxis
PTB Physiopraxis
Physiotherapie Rhea Brüdigam
Duske Sandra
Achsnig Marion Physiotherapie
Berthold Physiopraxis
Bohn Katharina Praxis für Physiotherapie
Erdmann L. Physiotherapie
Hennig Heidlinde Physiotherapie
Klatt Gabriele Physiotherapie
Physio- & Hydrotherapie Evelyn Ruß-Deuschle
Physiometik-Physiotherapie und Kosmetik
PhysioPlus Martin Berthold
Physiotherapie Elke Wegener
Physiotherapie Inh. Doreen Bastian
Therapiewelten Fromm Inh. Andrea Fromm Physiotherapie
Therapiewelten Fromm Inh. Andrea Fromm Physiotherapie
Therapiewelten Fromm Inh. Andrea Fromm Physiotherapie
vital & physio GmbH Portwich, Rene & Kristina
Neumann Andre Physiotherapie
Physiotherapie Heike Braun u.Gisela Wessel-Schutz
Physiotherapie Monika Laasch
Physiotherapiepraxis Briese Inke u. Engel Katrin
Schawaller, Mertens Physiotherapie
Ahrens Ch. Hoffmann B. Kautz K. Wiechert M. Physiotherapiepraxis
Lenz Andrea Praxis für Physiotherapie
PhysioKiDa
Physiotherapie Birgit Paul
Physiotherapie Hirsch U.
Maaß Ingrid Physiotherapie
Physiotherapie Birgit Vogt
Müller Holger Physiotherapie
Physiotherapie A. Fischer-Pifrement
Physiotherapie Schuberth Simone
Skupin Anne, Praxis für Physiotherapie und Kinderphysiotherapie
Stoll Sven Physiotherapie
Physiotherapiepraxis Lasch
Physiotherapie Leyer
Simon Petra Physiotherapie
Erdmann Petra Physiotherapeutische Praxis
Doremans-Harms Monika Physiotherapie
Holz-Gräfe Ulrike Physiotherapie
Kannenberg u. Swensson Praxisgemeinschaft für Physiotherapie
Keßler Dirk Physiotherapie
Physiotherapie Ahrens Ch., Hoffmann B., Kautz K. u. Wiechert M.
Physiotherapie Dorit Schumacher Praxis für Physiotherapie
Physiotherapie Höhnerbach
Physiotherapie Kerstin Wikert Physiotherapeutin
Physiotherapie Kollmorgen
Physiotherapie Neumann
Physiotherapie Physikalische Therapie Inh. Karin Hellmuth
Physiotherapiepraxis Angela Keller
Pöschmann Kathleen Menschen"s"kinder Physiotherapie
PTB Physiopraxis
Roberto Kollmorgen
Rothkirch Physiotherapie Ramona
Schmidt Josephine Praxis für Physiotherapie
Stoll Sven Physiotherapie
Strauß Arne
Thoms Christiane Physiotherapie
joni
  • 6,840
  • 2
  • 13
  • 20
1

BeautifulSoup is an HTML parser.

If you need to click buttons on an HTML page, use a tool that utilizes a real browser, like selenium.

Incase if you don't wish to learn about Selenium, a hacky solution is to download the HTML after clicking the Mehr Anzeigen and then parse that using BeautifulSoup. Here's a paste of the HTML after all the 90 entries are displayed: https://pastebin.pl/view/raw/277d9ea1

Ashok Arora
  • 531
  • 1
  • 6
  • 17
  • I would want to do this for multiple cities. I don't want to manually download the html page for many cities after clicking the "show more" button multiple times. Some big cities have more than 1000 entries! – user16093299 Jun 03 '21 at 11:20
  • Oh, then using Selenium is the right way to automate button clicks. – Ashok Arora Jun 03 '21 at 11:21