import multiple csv files from github folder - Python - COVID-19

Question

I'm trying to do two things here:

Import all the .csv files and add them up to a df.
Update the df with the latest file uploaded.

I have been able to import one .csv with:

import pandas as pd
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv' 
pd.read_csv(url).fillna(0)

I could import all the .csv files one per one (or with a loop if I knew how to extract all the .csv filenames), but there should be a more efficient way. Once I have the df, to "update" it I would:

Extract all the .csv filenames.
Check if all of them are in the df (with the date column). If one is missing, add the missing .csv file to the df.

The problems I'm having are: (a) how can I make scalable the way to extract all the .csv files? and (b) is there any way to extract ONLY the filenames that end with .csv from the github folder? In order to do (2) of above.

Does this answer your question? [Import multiple csv files into pandas and concatenate into one DataFrame](https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe) — MShakeG, Apr 05 '20 at 12:46
Noup, because the files are in the web. The main problem is extracting the filenames ending with `.csv` from the folder in the web. All the solutions below assume the folder is local, when it's in the web. — Chris, Apr 05 '20 at 18:30

score 1 · Answer 1 · answered Apr 05 '20 at 00:57

1

You can list all the csv files by this:

import glob

csvfiles = glob.glob("/path/to/foder/*.csv")

After you have all the csv file path, now you can loop over it and read it to a df, check if a column is missing or any other kind

answered Apr 05 '20 at 00:57

Binh

1,143
6
8

This answer does not seem to work. I just get an empty list instead of the names of the .csv files that I need. – Geonerd Dec 22 '20 at 17:44
@Geonerd because the path you gave has no csv files :) – Binh Dec 23 '20 at 07:15

Geonerd · Answer 2 · 2020-12-22T21:19:48.267

I am still trying to find a better solution but below is a workaround that I use with my code to pull from a github directory. Unfortunately, I still have not found a way to just get a list of CSVs in the github directoy like you can if it was on a local drive.

def read_multi_csv(start_year,end_year):     
    years = list(range(start_year,end_year+1))     
    dfs = []
    for YYYY in years:         
        file = 'https://raw.githubusercontent.com/username/project/main/data/normalized/'+str(YYYY)+'_crimes_byState.csv'             
        #print (file)         
        df = pd.read_csv(file)         
        dfs.append(df)
    all_dfs = df.concat(df)         
    return all_dfs  

read_multi_csv(2013,2019)

score 0 · Answer 3 · answered Apr 05 '20 at 01:06

I'd suggest u use pathlib, as it provides, IMHO, an easier way to deal with files:

from pathlib import Path 
files = Path(folder containing files)
#filter for only csv files
csv_only = files.rglob('*.csv')
#read ur csv files using a list comprehension
#u can attach the filename if it is relevant and makes sense
#by using the stem function from pathlib
combo = [pd.read_csv(f)
         .assign(f.stem)
         .fillna(0)
         for f in csv_only]

#u can lump them all into one dataframe, using pandas' concat function:

 one_df = pd.concat(combo,ignore_index=True)

#u can remove duplicates :

one_df = one_df.drop_duplicates('date')

import multiple csv files from github folder - Python - COVID-19

3 Answers3

Linked