Scrape html page that has text embedded in stylesheet and woff file

Question

I want to scrape a webpage but some data is embedded in the stylesheet and woff files.

Here are the links https://777codes.com/newtestament/mat1.html I want the Greek text here which does not show at all in Chromes inspector

And from here https://777codes.com/newtestament/gen1.html I want to get the Hebrew text but if you look in Chromes inspector you will see some "???" which comes out in the scrape

Basically Chromes element inspector shows blank or question marks but it shows correctly in the browser so I know the data is there.

Data missing is in Greek and Hebrew language.

I tried some basic scrapes with Beautiful Soup and very simple Selenium. They give the data in the element inspector which is incorrect. I want to get what I see in the browser.

I understand that sometimes Javascript renders content but this is a bit different I think.

Welcome to StackOverflow! Can you provide us the url, so we can test our code on it? — Jurakin, Feb 08 '23 at 15:41
Yes of course. I plan to scrape data off webpages I am generating myself. I will spare you the painful details of why and how but getting the Greek text out of the html is the last and most important part! I have uploaded a sample page and provided a link in the original question — ShaneO, Feb 08 '23 at 17:00
The site uses `GJOUKN+koineISA` font to display `greek` and some other font to display `hebrew` text, while the site is using `latin` alphabet transcript (the font renders it as `greek` or `hebrew`). — Jurakin, Feb 08 '23 at 17:26
You need use a script (or write it by yourself) to convert latin to other unicode characters such as [transliterate](https://pypi.org/project/transliterate/). — Jurakin, Feb 08 '23 at 17:29
I don't understand. Do you have problems with transliteration of alphabet? — Jurakin, Feb 08 '23 at 17:35
I just want to scrape the text. So right now if I do a scrape in beautiful soup to pull from the relevant divs I get the hebrew text with some chars as question marks. I suspected I needed to build a map of some kind but was not sure how or where to start. Thank you for the transliteration keyword. Will look that up. Would also really appreciate any other links to more info about whats going on here. — ShaneO, Feb 08 '23 at 17:44
With which tag do you see the question marks? I can't see any hebrew text and with beautiful soup `.text` seems to works fine. — Jurakin, Feb 08 '23 at 18:29
Sorry I want the Hebrew text. In Chromes element inspector some of the hebrew words have question marks between the letters. This comes through the same way when I get the text for that div with BS ie words with question marks in them. Do I have to scrape first and than ran the result through a detransliterating script to clear the question mark chars or do you see all the hebrew correctly which would mean I am using BS incorrectly? — ShaneO, Feb 08 '23 at 18:44
Can you please tell me which div do you want to get? I see greek text at the most. — Jurakin, Feb 08 '23 at 18:47
No sorry I posted two different links. Hebrew is at https://777codes.com/newtestament/gen1.html and Greek is at https://777codes.com/newtestament/mat1.html — ShaneO, Feb 08 '23 at 19:08

Jurakin · Accepted Answer · 2023-02-09T06:09:23.577

0

Actually, you don't need the transliterate library. I was able to extract the hebrew chars from the site using beautiful soup.

import requests
from bs4 import BeautifulSoup

page = requests.get("https://777codes.com/newtestament/gen1.html")
soup = BeautifulSoup(page.content, "html.parser")

first_hebrew_word = soup.find("div", class_="stl_01 stl_21")

# outputs 1:1 יתꢀרא (including hebrew chars)
print(first_hebrew_word.text)

# if you want to clean the output

# copy the object to prevent future errors
word = first_hebrew_word.__copy__()
for garbage in word.find_all("span", class_="stl_22"):
    # remove garbage
    garbage.decompose()

# outputs יתꢀראꢁ (including hebrew chars)
print(word.text.strip())

with open("output.txt", "w") as file:
    file.write(word.text.strip() + "\n")

Output text in gedit (ubuntu linux)

Zoomed output in firefox (ubuntu linux)

edited Feb 09 '23 at 06:09

answered Feb 08 '23 at 19:39

Jurakin

832
1
5
19

Unfortunately I still get question marks in place of some Hebrew letters. Please see https://snipboard.io/u82AaT.jpg Does it work correctly for you? Do you get "בראשית"? I think the transliterate library has to be used. If you could just share a but more info or links to that I will work it out. I assume those question marks have unique values. Can I save them as they are and later work out how to get them to show correctly. – ShaneO Feb 08 '23 at 20:08
@ShaneO I think vs code terminal is not displaying them properly. See my edit to save output to file and try opening it in diffrent text editor or open it in firefox, chrome. – Jurakin Feb 09 '23 at 06:04
The text returned should be "בראשית" The scrape is returning "יתꢀראꢁ" so it appears some chars are being rendered by fonts. Is there any way I could get a value from those missing letters so I can then build a map and replace? How would I get those question marks converted to a unicode value which I can look up? Would a data.encode("utf-8") work? I really appreciate your help with getting me this far. I think I should be able to find a way to make this work now. I just don't want to do something awkward when there is a simple solution – ShaneO Feb 09 '23 at 07:29
@ShaneO Yes, you need to create a map that replaces the incorrect characters. The script returns what is on the web page, and the font renders it. I don't know how to do it better than replacing it. – Jurakin Feb 09 '23 at 14:44
Thank you how do I identify and distinguish the question mark characters? Is there a python function that will give me the unicode code points or something similar where I can run the incomplete string through ditrectly after getting it from the scrape – ShaneO Feb 09 '23 at 14:50
@ShaneO See this [post](https://stackoverflow.com/a/7291199/14900791) that shows the unicode of the character. `char.encode("unicode_escape")` to encode, `char.decode("unicode_escape")` to decode. You can simply use big `.replace` function or build a dict to map all replacements. – Jurakin Feb 09 '23 at 15:11
1

Yes this works perfectly. However because I am replacing LTR chars with RTL chars the letters get jumbled. Is there anyway to conserve the letter order and just replace? – ShaneO Feb 10 '23 at 13:46

Scrape html page that has text embedded in stylesheet and woff file

1 Answers1