How do I scrape the equivalent of : Highlight the whole page in browser (i.e not page source), copy/paste in Notepad (i.e no hyperlinks, just text)
class TextOnlySpider(scrapy.Spider):
name = "onepage"
allowed_domains = ["en.wikipedia.org"]
start_urls = ['https://en.wikipedia.org/wiki/Congee']
def start_requests(self):
urls = [
'https://en.wikipedia.org/wiki/Congee'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
# Below line gives HTML/javascript, etc.
# I only want TEXT equaivalent. i.e Same text content that one gets, by copying in browser and pasting in Notepad/vim
bodyText = '\n'.join(response.xpath('//text()').extract())
yield{
'text': bodyText, #TODO only get TEXT equivalent of rendered page (text seen by human eyes)
'title': response.url, #TODO change to title
'id':response.url,
}
I want the text that humans-read, not the page-source as in this answer:
Scrapy Body Text Only
Reason:
I'll get the text representation, and page url and index it in elasticsearch so it becomes a site-search solution. I don't want messy html/js code while indexing.