0

I have some datasets in a github repo and I want to move them to S3 using python without saving locally anything .

This is my source public repo : [https://github.com/statsbomb/open-data/tree/master/data]

I have seen boto3 working but I have to save the file in my workspace to upload it to s3. This is too much data to download so i want to move directly to s3 then start wrangling the data .

Mido
  • 31
  • 1
  • 6
  • 1
    The S3 API has no method or function to import data from a git repository. Your best bet is probably to spin up an EC2 instance, clone the repo to it locally, then use the AWS CLI to sync the data you're after up to S3. – Anon Coward Feb 15 '23 at 20:44
  • Perhaps a lambda that uploads based on the file's url on github, a la https://stackoverflow.com/a/53402073/642070 – tdelaney Feb 15 '23 at 20:50
  • With the EC2 option, you could do the git clone and s3 sync from within the instance's launch userdata script followed by EC2 self-termination so it's mostly automated. – jarmod Feb 15 '23 at 21:14
  • Why do you not wish to save anything locally? While you could retrieve the contents of a file to memory and then upload to S3 from memory, there is no benefit in doing so. What problem are you hoping to solve/avoid by not saving locally? – John Rotenstein Feb 16 '23 at 10:24
  • @JohnRotenstein And why would I download locally and then upload to cloud consuming both time and memory space if I could get the data from a public source to the target cloud directly with one upload process. – Mido Feb 23 '23 at 21:54
  • It is not possible to tell Amazon S3 to 'pull' information from another datasource. Even the accepted answer (below) downloads the response from the GET request and then uploads it to Amazon S3. When you say "with one upload process", it still requires the data to be downloaded before it can be uploaded. – John Rotenstein Feb 23 '23 at 22:00
  • I have used something close to the accepted answer as below: – `def upload_to_s3 (url_response ,file_name ):` `s3object = s3.Object(bucket_name, ''.join([folder,file_name]))` `s3object.put( Body=(bytes(json.dumps(url_response.json()).encode('UTF-8'))))` This way i dont have to download the dataset on disk then upload again. However I am still trying to optimize it so let me know what you think, which way is better. @JohnRotenstein – Mido Feb 24 '23 at 20:12
  • 1
    While the dataset might not have been saved to disk, it was still "downloaded" to your computer. The only operation saved was writing to disk and reading it from disk again. In some situations (eg high-speed trading) these few milliseconds can be important, but it is a trade-off with potentially limited RAM space. The main thing is that it's working for you, which is great! – John Rotenstein Feb 24 '23 at 21:14

1 Answers1

0
import requests
import boto3

s3 = boto3.client('s3')
bucket_name = 'your_bucket_name'

# List of datasets you want to download
datasets = [
    'events', 
    'matches', 
    'competitions.json', 
    'lineups'
]

# Download the datasets and upload them to S3
for dataset in datasets:
    url = f'https://github.com/statsbomb/open-data/blob/master/data/{dataset}.json?raw=true'
    response = requests.get(url, stream=True)
    s3.upload_fileobj(response.raw, bucket_name, f'{dataset}.json')
Mrinal
  • 125
  • 5