42

Lets say there's a file that lives at the github repo:

https://github.com/someguy/brilliant/blob/master/somefile.txt

I'm trying to use requests to request this file, write the content of it to disk in the current working directory where it can be used later. Right now, I'm using the following code:

import requests
from os import getcwd

url = "https://github.com/someguy/brilliant/blob/master/somefile.txt"
directory = getcwd()
filename = directory + 'somefile.txt'
r = requests.get(url)

f = open(filename,'w')
f.write(r.content)

Undoubtedly ugly, and more importantly, not working. Instead of the expected text, I get:

<!DOCTYPE html>
<!--

Hello future GitHubber! I bet you're here to remove those nasty inline styles,
DRY up these templates and make 'em nice and re-usable, right?

Please, don't. https://github.com/styleguide/templates/2.0

-->
<html>
  <head>
    <meta http-equiv="Content-type" content="text/html; charset=utf-8">
    <title>Page not found &middot; GitHub</title>
    <style type="text/css" media="screen">
      body {
        background: #f1f1f1;
        font-family: "HelveticaNeue", Helvetica, Arial, sans-serif;
        text-rendering: optimizeLegibility;
        margin: 0; }

      .container { margin: 50px auto 40px auto; width: 600px; text-align: center; }

      a { color: #4183c4; text-decoration: none; }
      a:visited { color: #4183c4 }
      a:hover { text-decoration: none; }

      h1 { letter-spacing: -1px; line-height: 60px; font-size: 60px; font-weight: 100; margin: 0px; text-shadow: 0 1px 0 #fff; }
      p { color: rgba(0, 0, 0, 0.5); margin: 20px 0 40px; }

      ul { list-style: none; margin: 25px 0; padding: 0; }
      li { display: table-cell; font-weight: bold; width: 1%; }
      #error-suggestions { font-size: 14px; }
      #next-steps { margin: 25px 0 50px 0;}
      #next-steps li { display: block; width: 100%; text-align: center; padding: 5px 0; font-weight: normal; color: rgba(0, 0, 0, 0.5); }
      #next-steps a { font-weight: bold; }
      .divider { border-top: 1px solid #d5d5d5; border-bottom: 1px solid #fafafa;}

      #parallax_wrapper {
        position: relative;
        z-index: 0;
      }
      #parallax_field {
        overflow: hidden;
        position: absolute;
        left: 0;
        top: 0;
        height: 370px;
        width: 100%;
      }

etc etc.

Content from Github, but not the content of the file. What am I doing wrong?

Fomite
  • 2,213
  • 7
  • 30
  • 46
  • 2
    You should really use `os.path.join()` to combine paths. `getcwd()` does not necessarily return a directory name ending in a path separator. – Martijn Pieters Jan 02 '13 at 10:38

4 Answers4

42

The content of the file in question is included in the returned data. You are getting the full GitHub view of that file, not just the contents.

If you want to download just the file, you need to use the Raw link at the top of the page, which will be (for your example):

https://raw.githubusercontent.com/someguy/brilliant/master/somefile.txt

Note the change in domain name, and the blob/ part of the path is gone.

To demonstrate this with the requests GitHub repository itself:

>>> import requests
>>> r = requests.get('https://github.com/kennethreitz/requests/blob/master/README.rst')
>>> 'Requests:' in r.text
True
>>> r.headers['Content-Type']
'text/html; charset=utf-8'
>>> r = requests.get('https://raw.githubusercontent.com/kennethreitz/requests/master/README.rst')
>>> 'Requests:' in r.text
True
>>> r.headers['Content-Type']
'text/plain; charset=utf-8'
>>> print r.text
Requests: HTTP for Humans
=========================


.. image:: https://travis-ci.org/kennethreitz/requests.png?branch=master
[... etc. ...]
Pranav Hosangadi
  • 23,755
  • 7
  • 44
  • 70
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 3
    if you want to access a file in a private repository, basic authentication works just fine: requests.get('https://raw.github.com/myfile.txt', auth=('username', 'passwd')) – linqu Nov 08 '13 at 16:48
  • 3
    arg, the code snippet got messed up, here again: `requests.get('https://raw.github.com/myfile.txt', auth=('username', 'passwd'))` – linqu Nov 08 '13 at 17:41
  • 1
    just a note that the raw link is served with cache time for 5 mins. so a change to the file will reflect in the raw link after mins. – FossilBlade Dec 07 '21 at 21:54
14

You need to request the raw version of the file, from https://raw.githubusercontent.com.

See the difference:

https://raw.githubusercontent.com/django/django/master/setup.py vs. https://github.com/django/django/blob/master/setup.py

Also, you should probably add a / between your directory and the filename:

>>> getcwd()+'foo.txt'
'/Users/burhanfoo.txt'
>>> import os
>>> os.path.join(getcwd(),'foo.txt')
'/Users/burhan/foo.txt'
Pranav Hosangadi
  • 23,755
  • 7
  • 44
  • 70
Burhan Khalid
  • 169,990
  • 18
  • 245
  • 284
9

Just as an update, https://raw.github.com was migrated to https://raw.githubusercontent.com. So the general format is:

url = "https://raw.githubusercontent.com/user/repo/branch/[subfolders]/file"

E.g. https://raw.githubusercontent.com/earnestt1234/seedir/master/setup.py. Still use requests.get(url) as in Martijn's answer.

Tom
  • 8,310
  • 2
  • 16
  • 36
2

Adding a working example ready for copy+paste:

import requests
from requests.structures import CaseInsensitiveDict

url = "https://raw.githubusercontent.com/organization/repo/branch/folder/file"

# If repo is private - we need to add a token in header:
headers = CaseInsensitiveDict()
headers["Authorization"] = "token TOKEN"

resp = requests.get(url, headers=headers)
print(resp.status_code)

(*) If repo is not private - remove the headers part.


Bonus:
Check out this Curl < --> Python-requests online converter.

Rot-man
  • 18,045
  • 12
  • 118
  • 124