-1

I am using python, request or aiohttp method to get page, and BeautifulSoup4 for parsing webpage. Server HTML page uses jinja template, so when i get this page using requests or aiohttp, i get something like this:

<a href="/{{username}}" class=\'pr\'>

but if you open this page using browser, code looks like this:

<a href="/gavrilka" class=\'pr\'>

request code:

import requests
url = 'MY URL'
header = {"MY HEADERS"}
payload = {}
response = requests.request("GET", url, headers=headers, data = payload)
print(response.text.encode('utf8'))

aiohttp code:

import aiohttp
url = 'MY URL'
header = {"MY HEADERS"}
payload = {}
async with aiohttp.ClientSession() as session:
    async with session.get(base_url, headers=headers) as resp:
        data = await resp.text()
        print(data)
    await session.close()

How should i do to get correct page text?

  • Why don't you just use BeautifulSoup for getting the page and parsing? – Oliver Hnat Nov 06 '20 at 12:38
  • You will need to let JavaScript render the website. See: https://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python There are many different approaches to the problem. Some are explained in the linked stackoverflow question. I personally use [rendertron](https://github.com/GoogleChrome/rendertron). Heroku support was recently added there so it is easy [getting your own rendertron instance up and running.](https://dashboard.heroku.com/new?button-url=https://github.com/GoogleChrome/rendertron/tree/main&template=https://github.com/GoogleChrome/rendertron/tree/main) – Tin Nguyen Nov 06 '20 at 12:40
  • I should mention that i am using this a part of telegram bot, deployed on linux server (AWS). – Aram Gishyan Nov 07 '20 at 07:14

1 Answers1

0

Used selenium and phantomjs, and now it works.

from selenium import webdriver
from bs4 import BeautifulSoup

url = "https://yourlink"

driver = webdriver.PhantomJS() 
driver.set_window_size(1024, 768)  # optional
driver.get(url)
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'lxml')