2

I am trying to scrape data from a website using python. The problem is: there is no browser installed, and it cannot be installed (it is a pure Debian OS, without the GUI). I was thinking that it might be possible to use a chrome driver and a headless mode in selenium, but it doesn't seem to work.

Here is my test code:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager
  
options = Options()
options.headless = True
  
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)
  
driver.get('https://www.kino-teatr.ru/')

search_bar = driver.find_element_by_id('search_input_top')  # find search bar
search_bar.send_keys('Avengers')  # enter the name of the movie
search_bar.send_keys(Keys.ENTER)  # get the results

page = driver.page_source
soup = BeautifulSoup(page, 'html.parser')

div = soup.find('div', class_='list_item')  # find the first item
print(div.find('a')['href'])  # find a link to the page

And it gives me the following error

WebDriverException: Message: unknown error: cannot find Chrome binary
Stacktrace:
#0 0x5606b093c113 <unknown>
#1 0x5606b04046d8 <unknown>
#2 0x5606b04259c9 <unknown>
#3 0x5606b042319a <unknown>
#4 0x5606b045de0a <unknown>
#5 0x5606b0457f53 <unknown>
#6 0x5606b042dbda <unknown>
#7 0x5606b042eca5 <unknown>
#8 0x5606b096d8dd <unknown>
#9 0x5606b0986a9b <unknown>
#10 0x5606b096f6b5 <unknown>
#11 0x5606b0987725 <unknown>
#12 0x5606b096308f <unknown>
#13 0x5606b09a4188 <unknown>
#14 0x5606b09a4308 <unknown>
#15 0x5606b09bea6d <unknown>
#16 0x7f35ddc8bea7 <unknown>

I've already tried installing the driver as described here and installing additional libraries as described here, but with no success.

Is it possible to use selenium without the installed browser and what should I do to achieve that?

Thanks in advance for any help or advice!

Oleg Ivanytskyi
  • 959
  • 2
  • 12
  • 28

1 Answers1

2

You can try to install requests lib and do the following to get required HTML page:

>>> import requests
>>> url = 'https://www.geeksforgeeks.org'
>>> response = requests.get(url).text
>>> '7 Alternative Career Paths For Software Engineers' in response
True

Then you can use LXML or BeautifulSoup to parse the page

UPDATE

from lxml import html

response = requests.post('https://www.kino-teatr.ru/search/', data={'text':'мстители'.encode('cp1251')}).content
doc = html.fromstring(response)
entries = doc.xpath('//div[@class="list_item_name"]/h4')
first_movie = entries[0].text_content()
JaSON
  • 4,843
  • 2
  • 8
  • 15
  • thanks for your answer! But, if I am not mistaken, `requests` does not support dynamic web pages. That is why I was trying to use `selenium` – Oleg Ivanytskyi Feb 14 '22 at 16:38
  • @OlegIvanytskyi What is your goal? do you want to get HTML doc or get specific (dynamic) data from page? – JaSON Feb 14 '22 at 16:41
  • The code above is just a test code. In the future, I want to find specific data on dynamic pages. I also need to use selenium's `send_keys` method – Oleg Ivanytskyi Feb 14 '22 at 16:45
  • @OlegIvanytskyi almost everything that you do on web-page (except executing JavaScript) is just sending HTTP-requests. Even if you need to fill the form and then press on submit button on page you can do the same with python-requests. So called "dynamic data" comes from XHR that can be simulated by python-requests. So instead of strictly sticking to Selenium you need to describe your exact issue and I'm quite sure that you can solve it with requests – JaSON Feb 14 '22 at 17:10
  • Ok, I edited the question with a better example. There is a website `kino-teatr.ru`, where I need to 1) enter a movie name into the search box, and 2) find the link to the first movie that appears in the results – Oleg Ivanytskyi Feb 14 '22 at 17:22
  • 1
    @OlegIvanytskyi you need to send POST request e.g. `response = requests.post('https://www.kino-teatr.ru/search/', data={'text':'RING'}).text` (pass name to `'text'`),then you can check `'После просмотра некой загадочной видеокассеты следует телефонный звонок и каждый просмотревший умирает. Жертве дается лишь одна неделя, а дальше следует неминуемая смерть.' in response` or `'Палач и жертва' in response` to see that your request returns you the list of founded movies. You can replace `.text` with `.content` to get HTML-doc and parse it with any HTML-parser (like suggested LXML or BeautifulSoup) – JaSON Feb 14 '22 at 17:38
  • Thanks a lot! But it does not seem to work with non-English movies:( For example `requests.post('https://www.kino-teatr.ru/search/', data={'text':'мстители'}).text` returns me a page saying that nothing was found. Is there maybe another parameter for language? And, by the way, where do I find what key-value pairs `data` parameter should contain? – Oleg Ivanytskyi Feb 14 '22 at 17:58
  • 1
    @OlegIvanytskyi seem to be encoding issue. Try `requests.post('https://www.kino-teatr.ru/search/', data={'text':'мстители'.encode('cp1251')}).text` – JaSON Feb 14 '22 at 18:19
  • Omg, it worked! Finally! Thank you so much! – Oleg Ivanytskyi Feb 14 '22 at 18:24
  • Just to understand it better, how do you know what key-value pairs should go to `data` parameter? – Oleg Ivanytskyi Feb 14 '22 at 18:25
  • 1
    @OlegIvanytskyi to check how HTTP-request looks like: in Browser press F12 -> switch to Network tab -> do whatever you need on page -> check sent requests. If F12 doesn't work - do right click -> select "Inspect (Q)". Also check updated answer - I've added LXML parsing part – JaSON Feb 14 '22 at 18:28
  • Got it, thanks! – Oleg Ivanytskyi Feb 14 '22 at 20:56