Given a list of URLs pointing to different News Websites, I would like to scrape enough raw text from those articles to generate keywords with nltk (NLP). But every News website is structured and parsed differently, is there a way to get only the raw text?
Asked
Active
Viewed 177 times
1
-
1Possibly a duplicate of [BeautifulSoup Grab Visible Webpage Text](https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text/1983219#1983219). – Julio Cezar Silva Aug 10 '19 at 18:27
-
Possible duplicate of [BeautifulSoup Grab Visible Webpage Text](https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text) – Nazim Kerimbekov Aug 11 '19 at 07:05
1 Answers
2
There are multiple ways to do it. For example, you can simply use requests
like below.
import requests
url = "https://wasi0013.com/"
content = requests.get(url).content
As, you are trying to scrape multiple news websites. You might have to parse websites that use Javascript to render. Contents that are rendered with JS can't be fetched using requests
. For those, You can use selenium with chromedriver or, geckodriver to scrape raw texts.
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
prefs = {
'profile.managed_default_content_settings.images': 2,
}
chrome_options.add_argument('--headless')
chrome_options.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(options=chrome_options)
url = "https://wasi0013.com"
driver.get(url)
raw_text = driver.page_source
note that, in the above code I've disabled images as we only need texts. This will load the pages slightly faster. Check out the documentation for more details.

Wasi
- 1,473
- 3
- 16
- 32