when I use urllib2 to crawl a wibsite,but without labels ,such as html,body

Question

import urllib2

url = 'http://www.bilibili.com/video/av1669338'

user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"

headers={"User-Agent":user_agent}

request=urllib2.Request(url,headers=headers)

response=urllib2.urlopen(request)

text = response.read()

text[:100]

'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xcd}ys\x1bG\xb2\xe7\xdfV\xc4|\x87\x1exhRk\x81\xb8\x08\x10\x90E\xfa\x89\xb2f\x9f\xe3\xd9\xcf\x9e\x1dyb7\xec\tD\x03h\x90\x90p\t\x07)yf"D\xf9I&EI\xd4}\x91\xb6.\xeb\xb0e\x93\x94%Y\xbc$E\xccW\x194\x00\xfe\xe5\xaf\xf0~Y\xd5\xd5\xa8\xeeF\x83\xa7'

Looks like that URL is serving (binary) video content, not HTML. What were you expecting? — John Gordon, May 10 '17 at 02:05
I want to crawl the label (),what the content values,but it's not have this label, even not have html label,I don't know what happened — fan, May 10 '17 at 02:11
The response is gzip encoded see `1F 8B` or `'\x1f\x8b` is the magic number / header definition of gzip see : https://en.wikipedia.org/wiki/List_of_file_signatures or https://tools.ietf.org/html/rfc1952#section-2.3.1 — jmunsch, Jun 05 '18 at 10:41

score 1 · Answer 1 · answered May 10 '17 at 03:24

import requests from bs4 import BeautifulSoup

def data(): url = 'http://www.bilibili.com/video/av1669338' user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36" headers = {"User-Agent": user_agent} response = requests.get(url, headers=headers)

data = response.content
_html = BeautifulSoup(data)
_meta = _html.head.select('meta[name=keywords]')
print _meta[0]['content']

score 0 · Accepted Answer · answered May 10 '17 at 02:29

0

Try this:

import bs4, requests
res = requests.get("http://www.bilibili.com/video/av1669338")
soup = bs4.BeautifulSoup(res.content, "lxml")
result = soup.find("meta", attrs = {"name":"keywords"}).get("content")
print result

answered May 10 '17 at 02:29

Tiny.D

6,466
2
15
20

Thank You！Could you tell me why use bs4 can solve this problem? – fan May 10 '17 at 02:33
@fan the thing is that here we use `requests` instead of `urllib2`, and then we pass the content of the response to bs4 to find the element very easily. For the difference of `requests` and `urllib2`, you could refer to http://stackoverflow.com/questions/2018026/what-are-the-differences-between-the-urllib-urllib2-and-requests-module – Tiny.D May 10 '17 at 03:01
Thanks ! It's very helpful for me .Best Wishes – fan May 10 '17 at 04:47

when I use urllib2 to crawl a wibsite,but without labels ,such as html,body

2 Answers2