0

I can't read html code of this website by using urllib

def tests(url):
 response = urllib.urlopen(url)
 soup = BeautifulSoup(response.read())
 universities=soup.findAll('a',{'class':'pin-link'})
 print universities

if __name__ == '__main__':
 tests("https://pinshape.com/shop?page=3&is-free=true&type=-streamable")
is it possible to read page source ?
Puck
  • 2,080
  • 4
  • 19
  • 30
user2647541
  • 77
  • 1
  • 10

3 Answers3

0

You could try using urllib.request. Taking a snippet of part of a code I am using, it works as follows

import urllib.request
with urllib.request.urlopen('https://pinshape.com/shop?page=2') as f:
   data = str(f.read()).replace('\n', '')

myfile = open("TestFile.txt", "r+")
myfile.write(data)
Iorek
  • 571
  • 1
  • 13
  • 31
0

Despite urllib, you can have a try of requests library, which is more human-able for beginners to use.

For example, by using requests, you can get your webpage like this

>>> import requests
>>> r = requests.get("https://pinshape.com/shop?page=2")
>>> r.text
>>> u'<!DOCTYPE html>\n<html class=\'no-js\' lang=\'en\'>\n<head>\n<meta charset=\'utf-8\'> ...

As a reminder, BeautifulSoup is not fast enough, you can have a look at

According to the above posts and my own experience, lxml is definitely faster than BeautifulSoup. You can check the below link for xpath tutorial

Hope it helps

Community
  • 1
  • 1
Eric
  • 2,636
  • 21
  • 25
0

The URL you're trying to access is HTTPS, notice the 'S', so you need to establish a secure connection. HTTP and HTTPS requests are handled very differently.

MAhsan
  • 113
  • 1
  • 3