how to split an html with python regular expressions

Question

Friends, I open an html from Python (a Jupyter notebook) in the following way:

import urllib.request
with urllib.request.urlopen('http://python.org/') as response:
html = response.read()

I am all set to work with this object. However, when I try to clean it with regular expressions it does not work:

import re
re.split(r'\W+', html)

The last command returns a type error:

cannot use a string pattern on a bytes-like object

What should I do?

Welcome to SO. I wouldn't use regexp here. This is a debate as old as time. Use a HTML parser instead, such as [beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs3/documentation.html) — Torxed, Oct 21 '18 at 16:12
Check out this post https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 for why parsing html with regex is a bad idea. Use Beautiful Soup. — Batman, Oct 21 '18 at 16:20

score 0 · Answer 1 · answered Oct 21 '18 at 16:07

0

You should use .decode from byte into a string.

html = response.read().decode('utf-8') then you can use regex for html.

answered Oct 21 '18 at 16:07

seuling

1 Answers1