0

So I am following this tutorial on Webscraping with Python. Anytime I run the code I come across this error

FileNotFoundError: [Errno 2] No such file or directory: './data/nyct/turnstile/turnstile_200314.txt'

I have a hunch it means the webscraper cannot access the file but when I inspect the HTML the file is present. Please help. Here is my code for reference:

import requests
import urllib.request
import time
from bs4 import BeautifulSoup

#Set URL you want to webscrape from
url = 'http://web.mta.info/developers/turnstile.html'

#Connect to URL
response = requests.get(url)

#Parse HTML and save to BeautifulSoup object
soup = BeautifulSoup(response.text,'html.parser')

#Loop to download whole dataset
linecount = 1 #var to track current line

for onetag in soup.findAll('a'):
    if linecount>=36:
        link = onetag['href']
        downloadurl = 'http://web.mta.info/developers/'+link
        urllib.request.urlretrieve(downloadurl,'./'+link[link.find('/turnsttile_')+1:])
        time.sleep(3)#pause code so as to not get flagged as spammer

    #increment for next line
    linecount+=1
Selcuk
  • 57,004
  • 12
  • 102
  • 110
kwesiopon
  • 11
  • 2
  • 3
    The error is not about reading from the web site; it is having difficulty writing the `.txt` file on your system. Do you have the `data/nyct/turnstile` directory in the same directory as your script? – Selcuk Mar 19 '20 at 03:24
  • I haven't made a directory for that I was assuming the file I was extracting would be sent there. Sorry, I'm pretty new to this so could you please elaborate. – kwesiopon Mar 19 '20 at 03:30
  • Sorry, what do you mean by "would be sent there"? Your code downloads from a URL then saves it to a local file. You must have the directory already created if you want to save a file under that directory. – Selcuk Mar 19 '20 at 03:34
  • So basically the code cannot save the file due to there not being a ```data/nyc/turnstile``` file directory? – kwesiopon Mar 19 '20 at 04:00
  • Please provide the entire error message. As an aside, you should pass the result of `response.content` to `BeautifulSoup()`, not `response.text`. Also, why use both requests and urllib.request? – AMC Mar 19 '20 at 04:19
  • Does this answer your question? [FileNotFoundError: \[Errno 2\] No such file or directory](https://stackoverflow.com/questions/22282760/filenotfounderror-errno-2-no-such-file-or-directory) – AMC Mar 19 '20 at 04:20
  • @kwesiopon Yes. – Selcuk Mar 19 '20 at 05:42

1 Answers1

0

Put the following script in a folder and run it. Make sure to adjust this portion [:2] to suit you need as I've defined it as a test:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

url = 'http://web.mta.info/developers/turnstile.html'
base = 'http://web.mta.info/developers/'

response = requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')
for tag in soup.select('a[href^="data/nyct/"]')[:2]:
    filename = tag['href'].split("_")[1]
    with open(filename,"wb") as f:
        f.write(requests.get(urljoin(base,tag['href'])).content)

If you wanna stick to .find_all(), this is something you can do to achieve the same:

for onetag in soup.find_all('a',href=True):
    if not onetag['href'].startswith('data/nyct/'):continue
    link = urljoin(base,onetag['href'])
    print(link)

Or like this:

for onetag in soup.find_all('a',href=lambda e: e and e.startswith("data/nyct/")):
    link = urljoin(base,onetag['href'])
    print(link)
SIM
  • 21,997
  • 5
  • 37
  • 109
  • Thank you! This works would you mind explaining the changes you made and how they made the code work? Again thanks for your help! – kwesiopon Mar 19 '20 at 17:24
  • Check out [this link](http://www.compciv.org/guides/python/fileio/open-and-write-files/) to get the clarity. – SIM Mar 19 '20 at 17:35