newspaper3k - get articles from HTML instead of URL

Question

I'm using newspaper3k inside Scrapy parse method. I want to extract links but I don't want to fetch the website again.

Is it possible to use this:

newspaper.build(..)

with plain html so I can call .articles than?

score 0 · Answer 1 · answered May 27 '22 at 11:10

I found this solution:

import httpx

from newspaper import Article

async def get_article(url):
    with httpx.AsyncClient() as client:
        response = await client.get(url)

    article = Article(url)
    article.set_html(response.text)
    article.parse()

newspaper3k - get articles from HTML instead of URL

1 Answers1