1

I'm trying to scrape some data from https://ocfs.ny.gov/main/childcare/ccfs_template.asp without having a count limit on number of records per page. The developer tools show a Post method accessing "https://apps.netforge.ny.gov/dcfs/Search/Search HTP/1.1" upon clicking Search (need to insert a space in the Name field before Search will execute).

I want to download all the data into a file. My code uses the requests.post module, but I'm not sure if I'm using it correctly. Error I get is shown below my code. Appreciate some guidance on how I should modify it. Fairly new to python.

Code as follows:

import requests, csv

dataArg={'Criteria.ModalityCode':'', 'Criteria.CountyID':'', 'Criteria.SchoolDistrict':'', 'Criteria.ZipCode':'', 'Criteria.FacilityName':'+', 'Criteria.RegistrationID':'', 'Criteria.MedicationOnly':'false', 'Criteria.NonTraditionalHoursOnly':'false', 'Criteria.ShowOpenOnly':'true', 'Criteria.ShowOpenOnly':'false', 'Paging.PageSize':''}
dataCsv = requests.post('https://apps.netforge.ny.gov/dcfs/Search/Search HTP/1.1',data=dataArg)

openFile = open('nydata', 'wb')
for chunk in dataCsv.iter_content(1000000):
    openFile.write(chunk)

open_csv = open('nydata')
csv_reader = csv.reader(open_csv)
list_data = list(csv_reader)

Error:

File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 346, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 850, in _validate_conn
    conn.connect()
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connection.py", line 326, in connect
    ssl_context=context)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\util\ssl_.py", line 329, in ssl_wrap_socket
    return context.wrap_socket(sock, server_hostname=server_hostname)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 407, in wrap_socket
    _context=self, _session=session)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 814, in __init__
    self.do_handshake()
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 1068, in do_handshake
    self._sslobj.do_handshake()
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 689, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:777)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\adapters.py", line 440, in send
    timeout=timeout
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 639, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\util\retry.py", line 388, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='apps.netforge.ny.gov', port=443): Max retries exceeded with url: /dcfs/Search/Search%20HTTP/1.1?Criteria.ModalityCode=&Criteria.CountyID=&Criteria.SchoolDistrict=&Criteria.ZipCode=&Criteria.FacilityName=+&Criteria.RegistrationID=&Criteria.MedicationOnly=false&Criteria.NonTraditionalHoursOnly=false&Criteria.ShowOpenOnly=true&Criteria.ShowOpenOnly=false&Paging.PageSize= (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:777)'),))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\NY.py", line 3, in <module>
    dataCsv = requests.get('https://apps.netforge.ny.gov/dcfs/Search/Search HTTP/1.1?Criteria.ModalityCode=&Criteria.CountyID=&Criteria.SchoolDistrict=&Criteria.ZipCode=&Criteria.FacilityName=+&Criteria.RegistrationID=&Criteria.MedicationOnly=false&Criteria.NonTraditionalHoursOnly=false&Criteria.ShowOpenOnly=true&Criteria.ShowOpenOnly=false&Paging.PageSize=')
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\adapters.py", line 506, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='apps.netforge.ny.gov', port=443): Max retries exceeded with url: /dcfs/Search/Search%20HTTP/1.1?Criteria.ModalityCode=&Criteria.CountyID=&Criteria.SchoolDistrict=&Criteria.ZipCode=&Criteria.FacilityName=+&Criteria.RegistrationID=&Criteria.MedicationOnly=false&Criteria.NonTraditionalHoursOnly=false&Criteria.ShowOpenOnly=true&Criteria.ShowOpenOnly=false&Paging.PageSize= (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:777)'),))
>>> 
== RESTART: C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\NY.py ==
Traceback (most recent call last):
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 346, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 850, in _validate_conn
    conn.connect()
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connection.py", line 326, in connect
    ssl_context=context)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\util\ssl_.py", line 329, in ssl_wrap_socket
    return context.wrap_socket(sock, server_hostname=server_hostname)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 407, in wrap_socket
    _context=self, _session=session)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 814, in __init__
    self.do_handshake()
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 1068, in do_handshake
    self._sslobj.do_handshake()
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 689, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:777)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\adapters.py", line 440, in send
    timeout=timeout
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 639, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\util\retry.py", line 388, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='apps.netforge.ny.gov', port=443): Max retries exceeded with url: /dcfs/Search/Search%20HTP/1.1 (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:777)'),))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\NY.py", line 19, in <module>
    dataCsv = requests.post('https://apps.netforge.ny.gov/dcfs/Search/Search HTP/1.1',data=dataArg)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\api.py", line 112, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\adapters.py", line 506, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='apps.netforge.ny.gov', port=443): Max retries exceeded with url: /dcfs/Search/Search%20HTP/1.1 (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:777)'),))
Keyur Potdar
  • 7,158
  • 6
  • 25
  • 40
coder101
  • 383
  • 4
  • 21
  • Is there something wrong with the way this question is posted? Curious why nobody has responded to it, considering there are quite a few python and web scraping experts on this site. Appreciate some guidance on the question above. – coder101 Mar 06 '18 at 16:09
  • No one answer to your post because your code is not reproducible. The link `https://apps.netforge.ny.gov/dcfs/Search/Search HTP/1.1` return a `Runtime Error`. The link is either not correct or the site requires login access. Your code have nothing implying anything related to `web scraping` other than open and read the file. – hcheung Mar 07 '18 at 05:59
  • Adding `verify=False` to the `.post()` method seems to work. But, I'm no expert on this, so can't explain why. – Keyur Potdar Mar 08 '18 at 06:01
  • @KeyurPotdar: Now it's reduced the error to a warning, but the resulting file has no data. Here is the revised command: "dataCsv = requests.post('https://apps.netforge.ny.gov/dcfs/Search/Search HTP/1.1',data=dataArg, verify=False)". dataArg is defined in the code presented in the question on top. Something else I need to do? – coder101 Mar 08 '18 at 06:44

1 Answers1

1

First of all, the POST method is accessing https://apps.netforge.ny.gov/dcfs/Search/Search and not https://apps.netforge.ny.gov/dcfs/Search/Search HTP/1.1.

Regarding SSL Cert Verification, the docs say:

Requests verifies SSL certificates for HTTPS requests, just like a web browser. By default, SSL verification is enabled, and Requests will throw a SSLError if it's unable to verify the certificate.

So, you can set verify=False to overcome this. But, note that you should't use this in a production code.

Finally, using this code will give you the page:

data = {
    'Criteria.ModalityCode': '',
    'Criteria.CountyID': '',
    'Criteria.SchoolDistrict': '',
    'Criteria.ZipCode': '',
    'Criteria.FacilityName': '+',
    'Criteria.RegistrationID': '',
    'Criteria.MedicationOnly': 'false',
    'Criteria.NonTraditionalHoursOnly': 'false',
    'Criteria.ShowOpenOnly': 'false',
    'Paging.PageSize': ''
}

dataCsv = requests.post('https://apps.netforge.ny.gov/dcfs/Search/Search', data=data, verify=False)
Keyur Potdar
  • 7,158
  • 6
  • 25
  • 40
  • Thanks for your help. Works now. I had to specify a value for Paging.PageSize - else wouldn't return data. Two follow-ups: 1. Why should you not use "verify=false" in a production code? What are the consequences of that? 2. Why in developer tools the Post method shows " HTTP/1.1" as part of the URL(https://apps.netforge.ny.gov/dcfs/Search/Search) after "Search/Search"?. What is the significance of " HTTP/1.1" and how would someone know to ignore or include it? – coder101 Mar 08 '18 at 12:47
  • Actually, it didn't show that to me. It shows the URL I've used in the code. Are you looking at the file called `Search`? – Keyur Potdar Mar 08 '18 at 12:48
  • Never mind. I must have picked that " HTTP/1.1" up from somewhere in the source code when the plain URL didn't seem to do the trick for me. Finally, why should we not use "verify=False" in a production code (as you cautioned above)? I mean, what are the consequences of that? – coder101 Mar 08 '18 at 12:56
  • Have a look at this answer - https://stackoverflow.com/a/12864892/7832176 – Keyur Potdar Mar 08 '18 at 14:11