1

I wrote a simple python script just to view the page source of a website.The website is https://kissanime.to. I am using the following small piece of code.

    import urllib2
    url = 'https://kissanime.to'
    link = urllib2.urlopen(url)
    print link

However the above process is not working and is showing the error message as follows

HTTP Error 403 : Forbidden

i tried a finding a solution to the above problem in the community and came up with this :-

     import urllib2
     url = 'https://kissanime.to'
     link1 = urllib2.Request(url,headers = {'User-Agent' : "Magic Browser"})
     link2 = urllib2.urlopen(link1)

However the above method also fails and now i am getting the error:-

HTTP Error 503 : Service Temporarily Unavailable

Is there any kind of workaround to this problem? I am all new to this web-crawling features of python. please help.

soumya dubey
  • 63
  • 1
  • 1
  • 10
  • My guess is the user agent is your problem - the site is blocking you. This is probably a duplicate of: http://stackoverflow.com/questions/28396036/python-3-4-urllib-request-error-http-403 – Jeff Jun 28 '16 at 14:20
  • so what would be the solution to the problem? – soumya dubey Jun 28 '16 at 14:22
  • It's in the link I provided. You have to specify a user agent that the site accepts. – Jeff Jun 28 '16 at 14:22
  • and that would be the version and the name of the browser i am using? i am using python 2.7.10 . sorry for asking very basic questions but it is all very new to me. – soumya dubey Jun 28 '16 at 14:24
  • No problem, web scraping is a bit of art and a bit of science. There's no one right user agent, but you can easily find out your own by just googling "find my browsers user agent", then make that the user agent for your program. It's just a string that contains the information. – Jeff Jun 28 '16 at 14:28
  • i tried providing the user agent yet it still shows `HTTP Error 503 : Service Temporarily Unavailable` i am using python 2.7.10 , is that the problem? – soumya dubey Jun 28 '16 at 14:37
  • Looks like they're too clever in blocking automated access then. You might have better luck with the `requests` library, or `mechanize`. They have some more tools for identifying yourself as legit. – Jeff Jun 28 '16 at 14:39

1 Answers1

1

Checked out the website, it makes you wait for 5 seconds while it does something before displaying any of it's main content.

I used the requests module, to get at this initial page that says "Wait 5 seconds":

import requests

r = requests.get("https://kissanime.to/")

# Throws an Insecure Platform warning on certain versions of python

print r.content

However, depending on what exactly you wish to scrape you can start by looking under the hood to comprehend how the site was built and devise a strategy for scraping the content you desire.

I must say having looked at the Network calls made in the site, it is pretty stubborn in that every call has tons of parameters and cookies embedded in them.

What specifically are you looking to scrape from this website?

Also, the server is returning 503 when it shows the initial page that says "Wait 5 seconds..."

  • Thanks it actually worked. BTW i am trying to download the episodes of one piece from that website . It felt really boring and time consuming trying to download them all by hand one after one.i have heard people using python to make lives easier so i also started learning stuff to make a web-crawler to download them – soumya dubey Jun 28 '16 at 15:25
  • You might want to look at patterns in the video source urls, without having to go through the homepage. Just right-click on the video and you should see an option to see the video url. Can you post one of them here? – Pushkar Chintaluri Jun 28 '16 at 15:36
  • i did look them up and only thing that is different for all the url's is a signature substring. this is the download url for the video`https://redirector.googlevideo.com/videoplayback?requiressl=yes&id=f28b2929ddeb2426&itag=18&source=webdrive&ttl=transient&app=texmex&ip=2001:19f0:6000:9ad4:5400:ff:fe20:66ec&ipbits=32&expire=1467135046&sparams=requiressl,id,itag,source,ttl,ip,ipbits,expire&signature=3149A7DDE3B2359A8FABFAC38E7CC4ED3E8FFBF7.DDB44E95904F82ACDA3A48DCCE3BECE59FC0224&key=ck2&mm=30&mn=sn-a5m7zne6&ms=nxu&mt=1467120529&mv=m&nh=IgpwcjAyLmxheDAyKgkxMjcuMC4wLjE&pl=38` – soumya dubey Jun 28 '16 at 15:48