0
data = re.sub('<[^>]*>', '', string=html).lower()

I want to crawl random pages. However, since it is impossible to scrape only the desired content, I post a question. Is it valid to delete the html using a regular expression after scratching it?

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
김경주
  • 25
  • 5
  • Does this answer your question? [Can you provide some examples of why it is hard to parse XML and HTML with a regex?](https://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg) – Patrick Mevzek May 25 '20 at 19:44

1 Answers1

0

html2text library or pextract lib are valid to question

김경주
  • 25
  • 5