0

A few weeks back I was faced with a challenge that was basically to use webscraping to get all the files of a GitHub repository and group them by extension and sum the size for those files of that particular extension. The important thing about this is that we SHOULD NOT use Github's API nor any webscraping tool.

My solution was to get the main HTML page as a string, apply a regex to extract all URLs that had <repo_owner>/<repo_name>/blob and <repo_owner>/<repo_name>/tree. From the blob URLs, we could make another request and apply another regex to extract the filesize and lines, for the URLs of the other type we'd make another request to extract more blob URLs. I did this until there are no URLs of the latter.

It solved the problem, but it was a pretty bad solution because we need to make too many requests to GitHub and we're always blocked at some point while analyzing a repository. I applied a delay between requests but it takes a "LOOOT" of time to process one repository, also, if we make like 10 requests simultaneously we'd still have too many requests problem.

Until today this bothers me because I couldn't find a better solution for this. As the challenge isn't valid anymore I'd like to know someone else's ideas of how this could be solved!

Wall-E
  • 438
  • 1
  • 6
  • 18
  • Had you considered an HTML parser like beautifulsoup ? (I'm assuming you know python here). https://www.crummy.com/software/BeautifulSoup/bs4/doc/. It's pretty easy library to pickup. – AaronS Jul 16 '20 at 20:28
  • Sorry, I forgot to mention that we couldn't use any webscraping tool. But I think that even with this I would fall into the same problem I encountered with my solution. – Wall-E Jul 16 '20 at 20:39
  • It's a HTML parser, not a webscraping tool. It parses and breaks down the tags into a tree of objects, this lets you lookup and access information across the page either in a hierarchical way or using queries just like in jQuery. – Havenard Jul 16 '20 at 20:48
  • @Havenard, you're right, but this still has the same problems that my implementation have. – Wall-E Jul 16 '20 at 21:40
  • Well there are only two evident problems from what I can tell, one is that you're using regular expressions to parse HTML, and [that is forbidden by law](https://stackoverflow.com/a/1732454/156811), hence why this topic was touched. The only other problem I can see is that you might want to do some form of parallel requests instead of doing one request at a time. Perhaps that's the problem you mean? – Havenard Jul 16 '20 at 21:46
  • Yeah so, naturally, GitHub wants to protect themselves from any form of flood attack including an excessive number of requests coming from a given IP address. This is a cue that you're doing something they don't want you to do and the obvious solution is, don't use webscraping. This is not a problem you are supposed to workaround (even though there are some ways). If you absolutely must use webscraping, than you will have to cope with the limitations they impose for their own good. – Havenard Jul 16 '20 at 21:56
  • Yeah, I totally agree. This was a job challenge, and the interviewer told me that someone else solved this problem by analyzing a repository in just a few seconds, and one of the requirements was that it supported thousands of concurrent requests. Now what bugs me is how was that possible. I'm pretty shaken up by it ! – Wall-E Jul 16 '20 at 21:59
  • 1
    Maybe he downloaded the repo as ZIP and scraped from there? Is that a valid solution? – Havenard Jul 16 '20 at 22:04
  • That or he wrote the code that makes parallel requests just so they saw that he knows the stuff, even though the code didn't actually work because of GitHub's filters, or he limited the number of parallel requests to whatever GitHub accepts. – Havenard Jul 16 '20 at 22:06
  • I actually thought of downloading the ZIP, but at the time I thought that they wanted more of "do all in the web" stuff. Downloading the zip would be more of working with files. At the interview, they told me that it would be valid also. – Wall-E Jul 16 '20 at 22:20
  • @Havenard, I was looking at GitHub and it looks like it has a file-finder endpoint that lists all files, but I still need the file size. Any ideas? – Wall-E Aug 24 '20 at 14:37

0 Answers0