Why can't I crawl this link in Python?

Question

I am trying to crawl the contents of a webpage but I don't understand why I am getting this error : http.client.IncompleteRead: IncompleteRead(2268 bytes read, 612 more expected)

here is th link I am trying to crawl : www.rc2.vd.ch

Here is the Python code I am using to crawl :

import requests
from bs4 import BeautifulSoup
def spider_list():
    url = 'http://www.rc2.vd.ch/registres/hrcintapp-pub/companySearch.action?lang=FR&init=false&advancedMode=false&printMode=false&ofpCriteria=N&actualDate=18.08.2015&rowMin=0&rowMax=0&listSize=0&go=none&showHeader=false&companyName=&companyNameSearchType=CONTAIN&companyOfsUid=&companyOfrcId13Part1=&companyOfrcId13Part2=&companyOfrcId13Part3=&limitResultCompanyActive=ACTIVE&searchRows=51&resultFormat=STD_COMP_NAME&display=Rechercher#result'

    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'html.parser')

    for link in soup.findAll('a', {'class': 'hoverable'}):
        print(link)

spider_list()

I tried with an other website link and it works fine, but why can't I crawl this one?

If it's not possible to do it with this code then how can I do it ?

------------ EDIT ------------

here is the full error message :

    Traceback (most recent call last):
  File "C:/Users/Nuriddin/PycharmProjects/project/a.py", line 19, in <module>
    spider_list()
  File "C:/Users/Nuriddin/PycharmProjects/project/a.py", line 12, in spider_list
    source_code = requests.get(url)
  File "C:\Python34\lib\site-packages\requests\api.py", line 69, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Python34\lib\site-packages\requests\api.py", line 50, in request
    response = session.request(method=method, url=url, **kwargs)
  File "C:\Python34\lib\site-packages\requests\sessions.py", line 465, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Python34\lib\site-packages\requests\sessions.py", line 605, in send
    r.content
  File "C:\Python34\lib\site-packages\requests\models.py", line 750, in content
    self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
  File "C:\Python34\lib\site-packages\requests\models.py", line 673, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "C:\Python34\lib\site-packages\requests\packages\urllib3\response.py", line 303, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "C:\Python34\lib\site-packages\requests\packages\urllib3\response.py", line 450, in read_chunked
    chunk = self._handle_chunk(amt)
  File "C:\Python34\lib\site-packages\requests\packages\urllib3\response.py", line 420, in _handle_chunk
    returned_chunk = self._fp._safe_read(self.chunk_left)
  File "C:\Python34\lib\http\client.py", line 664, in _safe_read
    raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(4485 bytes read, 628 more expected)

Could you provide the full stack trace? How do you run your program? — Vincent Beltman, Aug 18 '15 at 14:10
@VincentBeltman I am running the program through Pycharm I'll post the full error on my post hang on. — Horai Nuri, Aug 18 '15 at 14:11
http://stackoverflow.com/questions/14442222/how-to-handle-incompleteread-in-python You may use try and except, as explained in this answer. — Vineet Kumar Doshi, Aug 18 '15 at 14:14
Anyway the answer to the second part of the Q is that the webserver is b0rken, responding with Content-Length header containing more bytes than in the actual response — Antti Haapala -- Слава Україні, Aug 18 '15 at 14:15
@Sia I installed Python3 in idle ... and It is working fine! http://i.imgur.com/OdFDO2e.png see the image for reference. Your code is correct!! — Vineet Kumar Doshi, Aug 18 '15 at 14:26
@VineetKumarDoshi Thanks, I know that my code is correct the thing I don't understand is why it's working on your computer but not on mine... — Horai Nuri, Aug 18 '15 at 14:35

Vineet Kumar Doshi · Accepted Answer · 2015-08-18T14:53:13.760

2

There might be a problem with your editor.

I am getting correct results in python 3 with your code in IDLE.

Image is attached below for reference-

The only thing that I can think of is to somehow bypass the error:

import requests
from bs4 import BeautifulSoup
def spider_list():
    url = 'http://www.rc2.vd.ch/registres/hrcintapp-pub/companySearch.action?lang=FR&init=false&advancedMode=false&printMode=false&ofpCriteria=N&actualDate=18.08.2015&rowMin=0&rowMax=0&listSize=0&go=none&showHeader=false&companyName=&companyNameSearchType=CONTAIN&companyOfsUid=&companyOfrcId13Part1=&companyOfrcId13Part2=&companyOfrcId13Part3=&limitResultCompanyActive=ACTIVE&searchRows=51&resultFormat=STD_COMP_NAME&display=Rechercher#result'
    try:
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, 'html.parser')

        for link in soup.findAll('a', {'class': 'hoverable'}):
            print(link)
    except:
        pass
        #I am passing but you do whatever you want to do in case of error
spider_list()

Let me know if it helps.

edited Aug 18 '15 at 14:53

answered Aug 18 '15 at 13:50

Vineet Kumar Doshi

4,250
1
12
20

It didn't work with IDLE too, I've got the same error : `http.client.IncompleteRead: IncompleteRead(907 bytes read, 1867 more expected)`. – Horai Nuri Aug 18 '15 at 14:03
What happens if you try another website? – Vincent Beltman Aug 18 '15 at 14:42
@VineetKumarDoshi No it doesn't help :/ I get the same error, I'll try with another computer and if it works I'll compare everything on the two computers – Horai Nuri Aug 18 '15 at 14:47
I re-edited the code, try this one.. If this doesn't work then let me know. – Vineet Kumar Doshi Aug 18 '15 at 15:02
yes it works in case of an error now, however my function should scrap the url datas, if I get nothing there is no point for my project :/ now I tried on another computer everything is working fine I guess it's because this computer is 11 years old and everything on it is modified. But in case of error your code would be useful that's why I'll confirm this answer :) – Horai Nuri Aug 18 '15 at 15:16

SIM · Answer 2 · 2020-02-06T18:37:48.063

How about this one!!

import requests
from lxml.html import fromstring

url = 'https://www.rc2.vd.ch/registres/hrcintapp-pub/companySearch.action?lang=FR&init=false&advancedMode=false&printMode=false&ofpCriteria=N&actualDate=18.08.2015&rowMin=0&rowMax=0&listSize=0&go=none&showHeader=false&companyName=&companyNameSearchType=CONTAIN&companyOfsUid=&companyOfrcId13Part1=&companyOfrcId13Part2=&companyOfrcId13Part3=&limitResultCompanyActive=ACTIVE&searchRows=51&resultFormat=STD_COMP_NAME&display=Rechercher#result'

def spider_list(link):
    code = requests.get(link)
    tree = fromstring(code.text)
    skim = tree.xpath('//a[@class="hoverable"]/@href')
    print(skim)

if __name__ == '__main__':
    spider_list(url)

Why can't I crawl this link in Python?

2 Answers2