0

Is there any way to get all the text of a website without the source code?

Like: Opening a website and ctrl + a everything there.

import requests

content = requests.get('any url')
print(content.text)

This prints the source code in a text form but I want to achieve that with the above?

dejanualex
  • 3,872
  • 6
  • 22
  • 37
Sofi
  • 498
  • 4
  • 12
  • Does this answer your question? [BeautifulSoup Grab Visible Webpage Text](https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text) – Thomas Weller Feb 01 '21 at 11:42
  • If the site doesn't provide a way you can fetch the text directly then your only way is to fetch the page the way you did and extract out the text programmatically by parsing the page source. There are probably ways involving rendering the page and copying the text from the rendered version, but that's just the same with some more steps and complications. – Kemp Feb 01 '21 at 11:42

2 Answers2

1

Step 1: Get some HTML from a web page

Step 2: Use Beautiful Soup package to parse the HTML (Learn about Beautiful Soup if you don't have prior knowledge 'https://pypi.org/project/beautifulsoup4/')

Step 3: List the elements that are not required (eg-header, meta, script)

import requests
from bs4 import BeautifulSoup
url = 'https://www.zzz.com/yyy/ #give any url
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)
output = ''
blacklist = [
    '[document]',
    'noscript',
    'header',
    'html',
    'meta',
    'head', 
    'input',
    'script',
# name more elements if not required
]
for t in text:
    if t.parent.name not in blacklist:
        output += '{} '.format(t)

print(output)
Dharman
  • 30,962
  • 25
  • 85
  • 135
Arun Soorya
  • 447
  • 2
  • 9
0

For this you have to install beautifulsoup and lxml, but it will work after that.

from bs4 import BeautifulSoup
import requests

source = requests.get('your_url').text
soup = BeautifulSoup(source, 'lxml').text
print(soup)
ph140
  • 478
  • 3
  • 10