Is it not possible to clean web crawl without tagging? Is it impossible to make it clean with regular expression?

Question

data = re.sub('<[^>]*>', '', string=html).lower()

I want to crawl random pages. However, since it is impossible to scrape only the desired content, I post a question. Is it valid to delete the html using a regular expression after scratching it?

Does this answer your question? [Can you provide some examples of why it is hard to parse XML and HTML with a regex?](https://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg) — Patrick Mevzek, May 25 '20 at 19:44

score 0 · Answer 1 · answered May 29 '20 at 13:42

0

html2text library or pextract lib are valid to question

answered May 29 '20 at 13:42

김경주

25
5

Is it not possible to clean web crawl without tagging? Is it impossible to make it clean with regular expression?

1 Answers1