python can't read website html code

Question

I can't read html code of this website by using urllib

def tests(url):
 response = urllib.urlopen(url)
 soup = BeautifulSoup(response.read())
 universities=soup.findAll('a',{'class':'pin-link'})
 print universities

if __name__ == '__main__':
 tests("https://pinshape.com/shop?page=3&is-free=true&type=-streamable")

is it possible to read page source ?

It's not just plain HTML. There is javascript activating a sign in box, which is harder to parse, — Kim Ryan, Aug 17 '15 at 00:00

score 0 · Answer 1 · answered Aug 17 '15 at 00:06

0

You could try using urllib.request. Taking a snippet of part of a code I am using, it works as follows

import urllib.request
with urllib.request.urlopen('https://pinshape.com/shop?page=2') as f:
   data = str(f.read()).replace('\n', '')

myfile = open("TestFile.txt", "r+")
myfile.write(data)

answered Aug 17 '15 at 00:06

Iorek

571
1
13
31

urllib.request is for python 3 and above, is there any for python 2.7 ? – user2647541 Aug 17 '15 at 00:12

score 0 · Answer 2 · edited May 23 '17 at 12:15

Despite urllib, you can have a try of requests library, which is more human-able for beginners to use.

For example, by using requests, you can get your webpage like this

>>> import requests
>>> r = requests.get("https://pinshape.com/shop?page=2")
>>> r.text
>>> u'<!DOCTYPE html>\n<html class=\'no-js\' lang=\'en\'>\n<head>\n<meta charset=\'utf-8\'> ...

As a reminder, BeautifulSoup is not fast enough, you can have a look at

According to the above posts and my own experience, lxml is definitely faster than BeautifulSoup. You can check the below link for xpath tutorial

W3School: XPath Tutorial

Hope it helps

score 0 · Answer 3 · answered Aug 17 '15 at 01:18

0

The URL you're trying to access is HTTPS, notice the 'S', so you need to establish a secure connection. HTTP and HTTPS requests are handled very differently.

answered Aug 17 '15 at 01:18

MAhsan

113
1
3

python can't read website html code

3 Answers3