0

I'm trying to do two things here:

  1. Import all the .csv files and add them up to a df.
  2. Update the df with the latest file uploaded.

I have been able to import one .csv with:

import pandas as pd
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv' 
pd.read_csv(url).fillna(0)

I could import all the .csv files one per one (or with a loop if I knew how to extract all the .csv filenames), but there should be a more efficient way. Once I have the df, to "update" it I would:

  1. Extract all the .csv filenames.
  2. Check if all of them are in the df (with the date column). If one is missing, add the missing .csv file to the df.

The problems I'm having are: (a) how can I make scalable the way to extract all the .csv files? and (b) is there any way to extract ONLY the filenames that end with .csv from the github folder? In order to do (2) of above.

Chris
  • 2,019
  • 5
  • 22
  • 67
  • Does this answer your question? [Import multiple csv files into pandas and concatenate into one DataFrame](https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe) – MShakeG Apr 05 '20 at 12:46
  • Noup, because the files are in the web. The main problem is extracting the filenames ending with `.csv` from the folder in the web. All the solutions below assume the folder is local, when it's in the web. – Chris Apr 05 '20 at 18:30
  • I too am trying to solve this mystery – Geonerd Dec 22 '20 at 17:42

3 Answers3

1

You can list all the csv files by this:

import glob

csvfiles = glob.glob("/path/to/foder/*.csv")

After you have all the csv file path, now you can loop over it and read it to a df, check if a column is missing or any other kind

Binh
  • 1,143
  • 6
  • 8
1

I am still trying to find a better solution but below is a workaround that I use with my code to pull from a github directory. Unfortunately, I still have not found a way to just get a list of CSVs in the github directoy like you can if it was on a local drive.

def read_multi_csv(start_year,end_year):     
    years = list(range(start_year,end_year+1))     
    dfs = []
    for YYYY in years:         
        file = 'https://raw.githubusercontent.com/username/project/main/data/normalized/'+str(YYYY)+'_crimes_byState.csv'             
        #print (file)         
        df = pd.read_csv(file)         
        dfs.append(df)
    all_dfs = df.concat(df)         
    return all_dfs  

read_multi_csv(2013,2019)
Geonerd
  • 101
  • 1
  • 1
  • 6
0

I'd suggest u use pathlib, as it provides, IMHO, an easier way to deal with files:

from pathlib import Path 
files = Path(folder containing files)
#filter for only csv files
csv_only = files.rglob('*.csv')
#read ur csv files using a list comprehension
#u can attach the filename if it is relevant and makes sense
#by using the stem function from pathlib
combo = [pd.read_csv(f)
         .assign(f.stem)
         .fillna(0)
         for f in csv_only]

#u can lump them all into one dataframe, using pandas' concat function:

 one_df = pd.concat(combo,ignore_index=True)

#u can remove duplicates :

one_df = one_df.drop_duplicates('date')
sammywemmy
  • 27,093
  • 4
  • 17
  • 31