A few weeks back I was faced with a challenge that was basically to use webscraping to get all the files of a GitHub repository and group them by extension and sum the size for those files of that particular extension. The important thing about this is that we SHOULD NOT use Github's API nor any webscraping tool.
My solution was to get the main HTML page as a string, apply a regex to extract all URLs that had <repo_owner>/<repo_name>/blob and <repo_owner>/<repo_name>/tree. From the blob URLs, we could make another request and apply another regex to extract the filesize and lines, for the URLs of the other type we'd make another request to extract more blob URLs. I did this until there are no URLs of the latter.
It solved the problem, but it was a pretty bad solution because we need to make too many requests to GitHub and we're always blocked at some point while analyzing a repository. I applied a delay between requests but it takes a "LOOOT" of time to process one repository, also, if we make like 10 requests simultaneously we'd still have too many requests problem.
Until today this bothers me because I couldn't find a better solution for this. As the challenge isn't valid anymore I'd like to know someone else's ideas of how this could be solved!