0

I am doing a small project.and i want to open a url.i tried with this

url = 'http://www.ygdy8.net/html/gndy/dyzz/index.html'
content = urllib.request.urlopen(url).read() 

pat = re.compile('<div class="title_all"><h1><front color=#008800>.*?</a>>   </front></h1></div>'+ '(.*?)<td height="25" align="center" bgcolor="#F4FAE2"> ',re.S)
txt = ''.join(pat.findall(content))

but this give me the error

TypeError: can't use a string pattern on a bytes-like object

then i tried with

txt = ''.join(pat.findall(content.decode()))

but there also an error

    txt = ''.join(pat.findall(content.decode()))
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 251: invalid start byte

i looked for the answer but i dont know how to solve it.

wangcs
  • 97
  • 1
  • 1
  • 10

1 Answers1

0

The header implies content.decode('gb2312',errors='ignore') should work.

>>> content.find(b'charset')
226
>>> content[226:226 + 20]
b'charset=gb2312">\r\n<t'

However, your regex certainly will NOT work. You have front instead of font. Perhaps you wanted the following:

>>> pat = re.compile(r'<div class="title_all"><h1><font color=#008800>.*?</a>></font></h1></div>'+ r'(.*?)<td height="25" align="center" bgcolor="#F4FAE2"> ',re.S)

This catches the table stuff between the two pieces, as far as I can tell.

>>> txt = ''.join(pat.findall(content.decode('gb2312',errors='ignore')))
>>> print(txt[:500])

<div class="co_content8">
<ul>

<td height="220" valign="top"> <table width="100%" border="0" cellspacing="0" cellpadding="0" class="tbspan" style="margin-top:6px">
<tr> 
<td height="1" colspan="2" background="/templets/img/dot_hor.gif"></td>
</tr>
<tr> 
<td width="5%" height="26" align="center"><img src="/templets/img/item.gif" width="18" height="17"></td>
<td height="26">
    <b>

        <a href="/html/gndy/dyzz/20160920/52002.html" class="ulink">2016年井柏然杨颖《微微一笑很倾城》HD国语中字</a>
    </b>
<
>>> pat.pattern
'<div class="title_all"><h1><font color=#008800>.*?</a>></font></h1></div>(.*?)<td height="25" align="center" bgcolor="#F4FAE2"> '
>>> 
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172