0

The way I was thinking is regular expression

data = re.sub('[^0-9a-zA-Z\\s\\.\\,]', '', string=html).lower()
data = re.sub('<[^>]*>', '', string=html)
data = re.sub('[^ ㄱ-ㅣ가-힣]+', '', string=html)

However, the number may not be visible and the space may be too long.

I would appreciate any recommendations if there is a better way.

김경주
  • 25
  • 5
  • Can you better explain what you're attempting to achieve? Are you just wanting to scrape the contents of a HTML page and print the text, excluding the HTML? – PacketLoss May 26 '20 at 02:32
  • That's right. It is difficult to tag because the structure is different for each page to be crawled. So I want to save only the content after scraping the html text – 김경주 May 26 '20 at 02:42

0 Answers0