0

I am trying to accomplish a rather simple task...

I am looking to loop through all .csv files in a specified github repository, specifically, this one

The following minimal, complete, reproducible example should demonstrate the problem:

import pandas as pd, urllib, requests, os, glob
base_url = 'https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series'
# https://stackoverflow.com/questions/39065921/what-do-raw-githubusercontent-com-urls-represent
base_raw_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series'

#base_dir = os.listdir(base_url)
#base_raw_dir = os.listdir(base_raw_url)

# https://stackoverflow.com/questions/61036695/import-multiple-csv-files-from-github-folder-python-covid-19
csv_files = glob.glob(base_raw_url+'/*.csv')
print(csv_files)

[]

csv_files is an empty list, and both os.listdir() attempts result in:

OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series'

How can I simply loop through the directory? I am looking to ultimately get the complete path (url) for each of the .csv files.

artemis
  • 6,857
  • 11
  • 46
  • 99

1 Answers1

3

You can't access files like that with a web address. 'Os.listdir()' only works on your local machine. What you are trying to do is called 'web scraping' and you will want to try to use 'bs4' to complete your task. You will need to parse through html and get relevant links to each file.

A handy tutorial on BS4: https://realpython.com/beautiful-soup-web-scraper-python/

  • 1
    The `data-pjax="#repo-content-pjax-container"` attribute should be sufficient to get the files when scraping. – Wamadahama Apr 14 '21 at 02:01