I want to pull data from URLs that meet specific date requirements that are shown in the URL structure, and put that information into a csvs to be used locally.
http://web.mta.info/developers/data/nyct/turnstile/turnstile_190629.txt
The series of 6 digits at the end of the URL are year-month-day indicators: 190629
I am collecting the data from March through June (03-06) for years 2016 - 2019 (16-19). If the URL exists, create a csv and also combine them all into a single csv to feed into a pandas dataframe.
This works, but it's suuuuuper slow, and I know that it's not the most pythonic way of doing this.
import requests
import pandas as pd
import itertools
date_list = [['16', '17', '18', '19'],['03', '04', '05', '06'],['01', '02', '03', '03', '04', '05', '06'
,'07', '08', '09','10', '11', '12','13','14' ,'15', '16',
'17','18','19','20','21','22','23','24','25','26','27'
,'28','29','30','31']]
date_combo = []
# - create year - month - date combos
# - link: https://stackoverflow.com/questions/798854/all-combinations-of-a-list-of-lists
for sub_list in itertools.product(*date_list):
date_combo.append(sub_list)
url_lead = 'http://web.mta.info/developers/data/nyct/turnstile/turnstile_'
url_list = []
# - this checks the url is valid and adds them to a list
for year, month, day in date_combo:
concat_url = url_lead + year + month + day + '.txt'
response = requests.get(concat_url)
if response.status_code == 200:
# ---- creates a list of active urls
url_list.append(concat_url)
# ---- this creates individual csvs ---- change path for saving locally
# ---- filename is date
df = pd.read_csv(concat_url, header = 0, sep = ',')
df.to_csv(r'/Users/.../GitHub/' + year + month + day + '.csv')
# - this creates a master df for all urls
dfs = [pd.read_csv(url,header = 0, sep = ',') for url in url_list]
df = pd.concat(dfs, ignore_index = True)
df.to_csv(r'/Users/.../GitHub/seasonal_mta_data_01.csv')
My code is running as expected, but I'd appreciate any recommendations to clean it up!