WebScraper cannot access file

Question

So I am following this tutorial on Webscraping with Python. Anytime I run the code I come across this error

FileNotFoundError: [Errno 2] No such file or directory: './data/nyct/turnstile/turnstile_200314.txt'

I have a hunch it means the webscraper cannot access the file but when I inspect the HTML the file is present. Please help. Here is my code for reference:

import requests
import urllib.request
import time
from bs4 import BeautifulSoup

#Set URL you want to webscrape from
url = 'http://web.mta.info/developers/turnstile.html'

#Connect to URL
response = requests.get(url)

#Parse HTML and save to BeautifulSoup object
soup = BeautifulSoup(response.text,'html.parser')

#Loop to download whole dataset
linecount = 1 #var to track current line

for onetag in soup.findAll('a'):
    if linecount>=36:
        link = onetag['href']
        downloadurl = 'http://web.mta.info/developers/'+link
        urllib.request.urlretrieve(downloadurl,'./'+link[link.find('/turnsttile_')+1:])
        time.sleep(3)#pause code so as to not get flagged as spammer

    #increment for next line
    linecount+=1

The error is not about reading from the web site; it is having difficulty writing the `.txt` file on your system. Do you have the `data/nyct/turnstile` directory in the same directory as your script? — Selcuk, Mar 19 '20 at 03:24
I haven't made a directory for that I was assuming the file I was extracting would be sent there. Sorry, I'm pretty new to this so could you please elaborate. — kwesiopon, Mar 19 '20 at 03:30
Sorry, what do you mean by "would be sent there"? Your code downloads from a URL then saves it to a local file. You must have the directory already created if you want to save a file under that directory. — Selcuk, Mar 19 '20 at 03:34
So basically the code cannot save the file due to there not being a ```data/nyc/turnstile``` file directory? — kwesiopon, Mar 19 '20 at 04:00
Please provide the entire error message. As an aside, you should pass the result of `response.content` to `BeautifulSoup()`, not `response.text`. Also, why use both requests and urllib.request? — AMC, Mar 19 '20 at 04:19
Does this answer your question? [FileNotFoundError: \[Errno 2\] No such file or directory](https://stackoverflow.com/questions/22282760/filenotfounderror-errno-2-no-such-file-or-directory) — AMC, Mar 19 '20 at 04:20

SIM · Accepted Answer · 2020-03-19T05:04:30.787

Put the following script in a folder and run it. Make sure to adjust this portion [:2] to suit you need as I've defined it as a test:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

url = 'http://web.mta.info/developers/turnstile.html'
base = 'http://web.mta.info/developers/'

response = requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')
for tag in soup.select('a[href^="data/nyct/"]')[:2]:
    filename = tag['href'].split("_")[1]
    with open(filename,"wb") as f:
        f.write(requests.get(urljoin(base,tag['href'])).content)

If you wanna stick to .find_all(), this is something you can do to achieve the same:

for onetag in soup.find_all('a',href=True):
    if not onetag['href'].startswith('data/nyct/'):continue
    link = urljoin(base,onetag['href'])
    print(link)

Or like this:

for onetag in soup.find_all('a',href=lambda e: e and e.startswith("data/nyct/")):
    link = urljoin(base,onetag['href'])
    print(link)

Thank you! This works would you mind explaining the changes you made and how they made the code work? Again thanks for your help! — kwesiopon, Mar 19 '20 at 17:24
Check out [this link](http://www.compciv.org/guides/python/fileio/open-and-write-files/) to get the clarity. — SIM, Mar 19 '20 at 17:35

WebScraper cannot access file

1 Answers1