1

Given a list of URLs pointing to different News Websites, I would like to scrape enough raw text from those articles to generate keywords with nltk (NLP). But every News website is structured and parsed differently, is there a way to get only the raw text?

Community
  • 1
  • 1
Van Gran
  • 67
  • 9
  • 1
    Possibly a duplicate of [BeautifulSoup Grab Visible Webpage Text](https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text/1983219#1983219). – Julio Cezar Silva Aug 10 '19 at 18:27
  • Possible duplicate of [BeautifulSoup Grab Visible Webpage Text](https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text) – Nazim Kerimbekov Aug 11 '19 at 07:05

1 Answers1

2

There are multiple ways to do it. For example, you can simply use requests like below.

import requests
url = "https://wasi0013.com/"
content = requests.get(url).content

As, you are trying to scrape multiple news websites. You might have to parse websites that use Javascript to render. Contents that are rendered with JS can't be fetched using requests. For those, You can use selenium with chromedriver or, geckodriver to scrape raw texts.

from selenium import webdriver 
chrome_options = webdriver.ChromeOptions()
prefs = {
    'profile.managed_default_content_settings.images': 2,
}
chrome_options.add_argument('--headless')
chrome_options.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(options=chrome_options)
url = "https://wasi0013.com"
driver.get(url)
raw_text = driver.page_source

note that, in the above code I've disabled images as we only need texts. This will load the pages slightly faster. Check out the documentation for more details.

Wasi
  • 1,473
  • 3
  • 16
  • 32