-2

I want to crawl the data from this website
I only need the text "Pictograph - A spoon 勺 with something 一 in it"

I checked Network -> Doc and I think the information is hidden here.

enter image description here

Because I found there's a line is
i.length > 0 && (r += '<span>&raquo;&nbsp;Formation:&nbsp;&nbsp;<\/span>' + i + _Eb)

And I think this page generates part of the page that we can see from the link.

However, I don't know what is the code? It has html, but it also contains so many function().


Update
If the code is Javascript, I would like to know how can I crawl the website not using Selenium?

Thanks!

shihs
  • 331
  • 3
  • 11
  • page uses `JavaScript` to add this element so you may need [Selenium](https://selenium-python.readthedocs.io/) to control web browser which can run `JavaScript` – furas Dec 06 '19 at 00:08
  • @furas But I think Selenium it's not very efficient, so I'm wondering if there's any other ways to crawl the website? – shihs Dec 06 '19 at 00:30
  • if you find JavaCode which adds it then you can try to write the same in Python. If JavaScript loads data from some URL and you can find it then you can use it download it without Selenium. Maybe Selenium is slower but it is faster to create solution. I already have code for this with Selenium. – furas Dec 06 '19 at 00:33
  • check [Web-scraping JavaScript page with Python](https://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python) - it is old but some information can be still useful. Many of them are based on WebKit engine which is/was used in web browsers. – furas Dec 06 '19 at 01:27
  • BTW: I got this cached file with line which you found but it is very long and obfuscated so it is impossible to understand it. – furas Dec 06 '19 at 01:29

1 Answers1

1

This page use JavaScript to add this element. Using Selenium I can get HTML after adding this element and then I can search text in HTML. This HTML has strange construction - all text is in tag so this part has no special tag to find it. But it is last text in this tag and it starts with "Formation:" so I use BeautifulSoup to ge all text with all subtags using get_text() and then I can use split('Formation:') to get text after this element.

import selenium.webdriver
from bs4 import BeautifulSoup as BS

driver = selenium.webdriver.Firefox()
driver.get('https://www.archchinese.com/chinese_english_dictionary.html?find=%E4%B8%8E')

soup = BS(driver.page_source)
text = soup.find('div', {'id': "charDef"}).get_text()
text = text.split('Formation:')[-1]

print(text.strip())

Maybe Selenium works slower but it was faster to create solution.

If I could find url used by JavaScript to load data then I would use it without Selenium but I didn't see these information in XHR responses. There was few responses compressed (probably gzip) or encoded and maybe there was this text but I didn't try to uncompress/decode it.

furas
  • 134,197
  • 12
  • 106
  • 148