0

I'm currently working on a project, and so far I've been making it work by downloading the whole folder with the data I need from github on my local files and performing post processing from there. But if I want to post-process from the files directly from github so I can always have updated information, is there a way to get the folder I need into my script without manually downloading it?

Here's the repo, I need the entire data folder and it's composed of yaml files. https://github.com/openstates/people/tree/main/data

Here's how I'm currently extracting files from my local.

def extractingOfficials(file):
    with open(file, 'r') as f:
        try:
            official = yaml.safe_load(f)
            return official
        except yaml.YAMLError as exc:
            print(exc)
            
def mainExtractingFunction(rootdir):
    all_yml_files = []
    for subdir, dirs, files in os.walk(rootdir):
        for file in files:
            # skipping those who are retired and committees offices
            if "retired"  in subdir or "committees" in subdir:
                continue
            elif file.endswith(".yml") and "municipalities" not in file:
                all_yml_files.append(os.path.join(subdir, file))

    with ThreadPool(28) as pool:
        allofficials = pool.map(extractingOfficials, all_yml_files)
        pool.close()
    return allofficials
Anthon
  • 69,918
  • 32
  • 186
  • 246
Jimmy C
  • 11
  • 2
  • 5
    Why scrape when you can use `git pull` ? – Panagiotis Kanavos Aug 09 '23 at 17:25
  • @PanagiotisKanavos I need to get all the yaml files from the repo together into one dataframe, within one script. Sorry, I'm new to git stuff, can git pull help me achieve that? – Jimmy C Aug 09 '23 at 17:34
  • 1
    Check [How to pull specific directory with git](https://stackoverflow.com/questions/2425059/how-to-pull-specific-directory-with-git) – Panagiotis Kanavos Aug 09 '23 at 17:36
  • 1
    Beyond Git, Github has a [REST API](https://docs.github.com/en/rest/repos/contents?apiVersion=2022-11-28) that can be used to retrieve files or folders – Panagiotis Kanavos Aug 09 '23 at 17:38
  • Just `git clone` the whole directory tree. Now, all of those files are local to your computer, and you can play with them as you need to get them in memory. – Tim Roberts Aug 09 '23 at 17:40
  • Try: https://minhaskamal.github.io/DownGit/#/home?url=https://github.com/openstates/people/tree/main/data – JonSG Aug 09 '23 at 18:09

0 Answers0