0

How do I scrape the equivalent of : Highlight the whole page in browser (i.e not page source), copy/paste in Notepad (i.e no hyperlinks, just text)

    class TextOnlySpider(scrapy.Spider): 

    name = "onepage"

    allowed_domains = ["en.wikipedia.org"]
    start_urls = ['https://en.wikipedia.org/wiki/Congee']

    def start_requests(self):
        urls = [
            'https://en.wikipedia.org/wiki/Congee'
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        # Below line gives HTML/javascript, etc.
        # I only want TEXT equaivalent. i.e Same text content that one gets, by copying in browser and pasting in Notepad/vim
        bodyText = '\n'.join(response.xpath('//text()').extract())
        yield{
            'text': bodyText, #TODO only get TEXT equivalent of rendered page (text seen by human eyes)
            'title': response.url, #TODO change to title
            'id':response.url,
        }

I want the text that humans-read, not the page-source as in this answer:
Scrapy Body Text Only

Reason:

I'll get the text representation, and page url and index it in elasticsearch so it becomes a site-search solution. I don't want messy html/js code while indexing.

Espresso
  • 5,378
  • 4
  • 35
  • 66

1 Answers1

1

The module html2text can convert the html to plain text while removing the tags:

import html2text
converter = html2text.HTML2Text()
bodyText = converter.handle(response.text)

If you also want to get the text that is rendered, you'll need a headless browser like Splash to render the page first.

Wim Hermans
  • 2,098
  • 1
  • 9
  • 16
  • Thanks, this code-snippet, solves 50% of the problem. It gets gets rid of js function(), but other links/markup remain. I hope I can avoid heavyweight components like splash that can slow down. Thanks very much. – Espresso May 05 '20 at 15:37