This is a link to download a zip file including a 1GB of postcode level data published by the UK government's Office for National Statistics: https://www.arcgis.com/sharing/rest/content/items/19fac93960554b5e90840505bd73917f/data
Information on the data can be found here: http://geoportal.statistics.gov.uk/datasets/19fac93960554b5e90840505bd73917f
I have used this data in a data science application in Python, loading it into a Pandas dataframe. I have integrated this to a simple web page and am deploying it to the cloud. I do not want to include the large data in my repository which I am accessing from a AWS EC2 instance. Thus as I understand I have two options:
1) Include the zipped file in the repository and read the CSV into a Pandas dataframe.
2) Open the url, stream in the file and extract it in the script and then read the CSV into a Pandas dataframe.
The issue with both of these approaches is the zip file contains contents other than the csv file I need and I'm not sure how to specifically access this.
Another approach I considered was compressing just the individual csv I need before including it in the repository, but this seems to generate superfluous files:
('Multiple files found in compressed zip file %s', "['NSPCL_AUG19_UK_LU.csv', '__MACOSX/', '__MACOSX/._NSPCL_AUG19_UK_LU.csv']")
so I have the same issue with not being able to point directly to the file I need.
Please let me know what best practice is and how to get the file I need into a Pandas dataframe.