How to to convert raw contents of a large file stored on github to correct bytes array?

Question

I believe the recommended method to get the contents of a large file stored on GitHub is to use REST API. For the files which size is 1MB-100MB, it's only possible to get raw contents (in string format).

I need to use this content to write into a file. If I use pygithub package, I get exactly what I need (in bytes format, and the response object contains encoding field which value is base64). Unfortunately, this package does not work for files which size is greater than 1MB.

So it seems that I only need to find the correct way to convert string to bytes. There are many ways to do it, I have tried 4 so far, and neither matches the output of pygithub package. See the output of several guinea pig files below. How to do the conversion correctly?

from github import Github, ContentFile
import requests
from requests.structures import CaseInsensitiveDict
import base64

token = ...
repo_name = ...
owner = ...
filename = ...
   
# pygithub method
github_object = Github(token)
github_user = github_object.get_user()
repo = github_user.get_repo(repo_name)
cont_obj = repo.get_contents(filename)
print('encoding', cont_obj.encoding) # prints base64
content_ref = cont_obj.decoded_content # this works correctly for <1MB files

#REST API method
url = f"https://api.github.com/repos/{owner}/{repo_name}/contents/{filename}"

headers = CaseInsensitiveDict()
headers["Accept"] = "application/vnd.github.v3.raw"
headers["Authorization"] = f"Bearer {token}"
headers["X-GitHub-Api-Version"] = "2022-11-28"
contents_str = requests.get(url, headers=headers).text

contents = []
# https://stackoverflow.com/questions/72037211/how-to-convert-a-base64-file-to-bytes
contents.append(base64.b64decode(contents_str.encode() + b'=='))

# https://stackoverflow.com/questions/7585435/best-way-to-convert-string-to-bytes-in-python-3
contents.append(bytes(contents_str, encoding="raw_unicode_escape"))

# https://stackoverflow.com/questions/7585435/best-way-to-convert-string-to-bytes-in-python-3
message_bytes = contents_str.encode('utf-8')
contents.append(base64.b64encode(message_bytes))
#contents.append(base64.decodebytes(message_bytes + b'==')) same as method 0

print(type(content_ref), len(content_ref), content_ref[:50])
for i, c in enumerate(contents):
  print(i, type(c), len(c), c[:50])

The output of the guinea pig files:

The text file that contains tiny text allows telling that all but method 1 are incorrect

<class 'bytes'> 10 b'tiny text\n'

0 <class 'bytes'> 6 b'\xb6)\xf2\xb5\xecm'

1 <class 'bytes'> 10 b'tiny text\n'

2 <class 'bytes'> 16 b'dGlueSB0ZXh0Cg=='
for this pdf file, the length of the output of method 1 is slightly bigger, and the contents is slightly

<class 'bytes'> 3028 b'%PDF-1.3\r\n%\xe2\xe3\xcf\xd3\r\n\r\n1 0 obj\r\n<<\r\n/Type /Catalog\r\n/O'

0 <class 'bytes'> 1504 b'<1u\xdf](n?\xd3\xca\x97\xbf\t\xabZ\x96\x88?:\xebe\x8aw\xac\xdbD\x7f=\xa8\x1e\xb3}\x11zwhn=\xb4\xa1\xb8\xffO*^\xfc\xeb\xad\x96)'

1 <class 'bytes'> 3048 b'%PDF-1.3\r\n%\ufffd\ufffd\ufffd\ufffd\r\n\r\n1 0 obj\r\n<<'

2 <class 'bytes'> 4048 b'JVBERi0xLjMNCiXvv73vv73vv73vv70NCg0KMSAwIG9iag0KPD'
For this image, the size and the contents of output 1 are very different

<class 'bytes'> 57270 b'GIF89a\xfa\x00)\x01\xe7\xff\x00\x06\t\r\x0f\n\x08\x19\r\x0c \x0e\n,\x12\x0b"\x18\x17\x1f\x1a\x16"\x1b\x12&\x18\x18&!\x17%!\x1b* \x1d,'

0 <class 'bytes'> 329 b'\x18\x81|\xf5\xa17\xebo\xb6\xf3^\x1b\x03]\x02\xdb\xcd\x04\xfc\x8e\xc7\xd8\xb1D\x0c\xa36\xe8\xd3\x00\xbf\x9e\x94\xf5$\xbcT\x04D\xf9\x11\xfa\_U\x14\x05\xd5\xfce'

1 <class 'bytes'> 169079 b'GIF89a\ufffd\x00)\x01\ufffd\ufffd\x00\x06\t\r\x0f\n\x08\x19\r\x0c \x0e\n,\x12\x0b"\x18\x17\x1f\x1a\x16"'

2 <class 'bytes'> 132656 b'R0lGODlh77+9ACkB77+977+9AAYJDQ8KCBkNDCAOCiwSCyIYFx'

in the posted example, `contents_str` will not contain the response string but a response object. You have to read `contents_str.text` or `contents_str.raw` to get the data. Does this solve your problem or is it just a copy-paste error — Cube, Jul 05 '23 at 10:01
@DarkKnight I am trying to get the basic case to work first - would streaming help? If it falls into improvements category, I am also interested to hear what you suggest, although it's not a priority — Yulia V, Jul 05 '23 at 10:14

score 1 · Accepted Answer · answered Jul 05 '23 at 10:13

You can use the content attribute in the requests response to get the content as pure bytes. This way you get a bytes content you can save in a file, for example, as long as the file was opened in binary mode.

The next code is a vary simple example using a .png file from one of my public repos:

import requests

url = 'https://raw.githubusercontent.com/euribates/notes/main/docs/blender/developers-preference.png'
content_in_bytes = requests.get(url).content
assert type(content_in_bytes) is bytes
with open('image.png', 'wb') as f_out:
    f_out.write(content_in_bytes)

Please note Github uses a different host name (raw.githubusercontent.com) to get the raw content.

Hope this helps.

DarkKnight · Answer 2 · 2023-07-05T10:37:22.393

1

Assuming your URL is appropriate then this is how you could stream the content to a local file.

import requests
# assign the following variables as appropriate
owner = 'owner'
repo = 'repo'
filename = 'filename'
token = 'token'
output_file = '/Users/Yulia/Downloads/{filename}'

HEADERS = {
    'Accept': 'application/vnd.github.v3.raw',
    'Authorization': f'Bearer {token}',
    'X-GitHub-Api-Version': '2022-11-28'
}
CHUNK = 16*1024

with requests.get(f'https://api.github.com/repos/{owner}/{repo}/contents/{filename}', headers=HEADERS, stream=True) as response:
    response.raise_for_status()
    with open(output_file, 'wb') as output:
        for chunk in response.iter_content(CHUNK):
            output.write(chunk)

If you don't want to stream the content then it's just:

with requests.get(f'https://api.github.com/repos/{owner}/{repo}/contents/{filename}', headers=HEADERS) as response:
    response.raise_for_status()
    with open(output_file, 'wb') as output:
        output.write(response.content)

edited Jul 05 '23 at 10:37

answered Jul 05 '23 at 10:22

DarkKnight

19,739
3
6
22

Thank you. Could you explain the added value of getting the contents by chunks if I run it as an automated task? In practice, it would be copying from one cloud storage (SFT server) to another cloud storage (GitHub, pCloud etc.), or in the opposite direction, as an automated task, so my personal internet quality is irrelevant I presume. For my user case, the added value of chunking is unobvious to me, but I am not an expert, maybe there is one. – Yulia V Jul 05 '23 at 10:33
@YuliaV If the file(s) being downloaded are very large, streaming can reduce the impact on the client's memory requirements. If that's not going to be a problem then just open the local file ('wb' mode) and write response.content to it in a single step – DarkKnight Jul 05 '23 at 10:35

score 0 · Answer 3 · answered Jul 05 '23 at 10:56

like others already said: requests.get().content gets you the raw bytes and will be what you need. Using this your test return wahts expected:

from github import Github, ContentFile
import requests
from requests.structures import CaseInsensitiveDict
import base64

token = "..."

owner = "..."
repo_name = "test"
filename= "test.file"



# pygithub method
github_object = Github(token)
github_user = github_object.get_user()
repo = github_user.get_repo(repo_name)
cont_obj = repo.get_contents(filename)

content_ref = cont_obj.decoded_content # this works correctly for <1MB files


#REST API method
url = f"https://api.github.com/repos/{owner}/{repo_name}/contents/{filename}"
print(url)

headers = CaseInsensitiveDict()
headers["Accept"] = "application/vnd.github.v3.raw"
headers["Authorization"] = f"Bearer {token}"
headers["X-GitHub-Api-Version"] = "2022-11-28"
content_req = requests.get(url, headers=headers).content


print(type(content_ref), len(content_ref), content_ref[:50])
print(type(content_req), len(content_req), content_req[:50])

print(cont_obj.content)
print(base64.b64encode(content_req))

The code returns:

<class 'bytes'> 25 b'hello World\n\nadded stuff\n'
<class 'bytes'> 25 b'hello World\n\nadded stuff\n'
aGVsbG8gV29ybGQKCmFkZGVkIHN0dWZmCg==

b'aGVsbG8gV29ybGQKCmFkZGVkIHN0dWZmCg=='

This show that both methods produce identical result and using base64.b64encode() on the requests content produces the same encode, just as a byte-string.

Hope this helps

How to to convert raw contents of a large file stored on github to correct bytes array?

3 Answers3