-1

Friends, I open an html from Python (a Jupyter notebook) in the following way:

import urllib.request
with urllib.request.urlopen('http://python.org/') as response:
html = response.read()

I am all set to work with this object. However, when I try to clean it with regular expressions it does not work:

import re
re.split(r'\W+', html)

The last command returns a type error:

cannot use a string pattern on a bytes-like object

What should I do?

ChaosPredictor
  • 3,777
  • 1
  • 36
  • 46
  • Welcome to SO. I wouldn't use regexp here. This is a debate as old as time. Use a HTML parser instead, such as [beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs3/documentation.html) – Torxed Oct 21 '18 at 16:12
  • 1
    Check out this post https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 for why parsing html with regex is a bad idea. Use Beautiful Soup. – Batman Oct 21 '18 at 16:20

1 Answers1

0

You should use .decode from byte into a string.

html = response.read().decode('utf-8') then you can use regex for html.

seuling
  • 2,850
  • 1
  • 13
  • 22