1

I am doing web scraping on this URL which is a newspaper image with highlighted words. My purpose is to retrieve all those highlighted words in red. Inspecting the page gives the class: image-overlay hit-rect ng-star-inserted in which attribute title must be extract:

enter image description here Using the following code snippet with BeautifulSoup:

from bs4 import BeautifulSoup
pg_snippet_highlighted_words = soup.find_all("div", class_="image-overlay hit-rect ng-star-inserted")
print(pg_snippet_highlighted_words) # returns nothing: []
print(pg_snippet_highlighted_words.get("title")) # AttributeError: ("'NoneType' object has no attribute 'get'",) when soup.find() is executed!

However, I get [] as a result!

My expected result is a list with length of 17 in this specific example, containing all the highlighted words in this page, e.g., the ones identified with title attribute in inspect as follows:

EXPECTED_RESULT = ["Katri", "Katrina", "Katri", "Katri", "Katri", "Katri", "Katri", "Katri", "Ikonen.", "Katrina", "Katri", "Ikonen.", "Katri", "Katrina", "Katri", "Katri", "Katri"]

Is BeautifulSoup a correct tool to extract information when dealing with dynamic content?

Cheers,

Farid Alijani
  • 839
  • 1
  • 7
  • 25
  • Just FYI, when you deal with dynamic websites that uses js to load stuff (CSR websites for example), you should think of using browser automation, tools like Playwright and Puppeteer for example. – Lahcen YAMOUN Jan 23 '23 at 17:28

1 Answers1

1

The data you're looking for is loaded from external URL via JavaScript. To get the data you can use following example:

import requests

api_url = "https://digi.kansalliskirjasto.fi/rest/binding-search/ocr-hits/761979"
params = {"page": "12", "term": ["Katri", "Katrina", "Ikonen"]}

data = [d["text"] for d in requests.get(api_url, params=params).json()]
print(data)

Prints:

['Katri', 'Katrina', 'Katri', 'Katri', 'Katri', 'Katri', 'Katri', 'Katri', 'Ikonen.', 'Katrina', 'Katri', 'Ikonen.', 'Katri', 'Katrina', 'Katri', 'Katri', 'Katri']
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Thank you so much! may I just ask how you generated `api_url` ? I would like to know if there might be a documentation for such REST API which I could use for later application! – Farid Alijani Jan 23 '23 at 17:46
  • 1
    @FaridAlijani When you open Firefox developer tools (or similar in Chrome) -> Network Tab and reload the page, you will see all requests the page is doing. One of these requests is this api call. – Andrej Kesely Jan 23 '23 at 19:10
  • May I also kindly ask you to have a look at my another question https://stackoverflow.com/q/74208495/5437090 regarding web scraping of a similar page with dynamic contents! I tried to follow your approach to find an API in Chrome developer tool --> Network --> Requested URL to retrieve information faster and more efficient but all I get is : `Please turn on JavaScript in order to use the application.` in preview or response tabs! – Farid Alijani Jan 25 '23 at 08:29