Is there a way to scrape the html from a random webpage and then make it visible only text?

Asked May 26 '20 at 02:29

Active May 26 '20 at 02:29

Viewed 26 times

The way I was thinking is regular expression

data = re.sub('[^0-9a-zA-Z\\s\\.\\,]', '', string=html).lower()
data = re.sub('<[^>]*>', '', string=html)
data = re.sub('[^ ㄱ-ㅣ가-힣]+', '', string=html)

However, the number may not be visible and the space may be too long.

I would appreciate any recommendations if there is a better way.

asked May 26 '20 at 02:29

김경주

Can you better explain what you're attempting to achieve? Are you just wanting to scrape the contents of a HTML page and print the text, excluding the HTML? – PacketLoss May 26 '20 at 02:32
That's right. It is difficult to tag because the structure is different for each page to be crawled. So I want to save only the content after scraping the html text – 김경주 May 26 '20 at 02:42

Is there a way to scrape the html from a random webpage and then make it visible only text?

0 Answers0