2

This is a link to download a zip file including a 1GB of postcode level data published by the UK government's Office for National Statistics: https://www.arcgis.com/sharing/rest/content/items/19fac93960554b5e90840505bd73917f/data

Information on the data can be found here: http://geoportal.statistics.gov.uk/datasets/19fac93960554b5e90840505bd73917f

I have used this data in a data science application in Python, loading it into a Pandas dataframe. I have integrated this to a simple web page and am deploying it to the cloud. I do not want to include the large data in my repository which I am accessing from a AWS EC2 instance. Thus as I understand I have two options:

1) Include the zipped file in the repository and read the CSV into a Pandas dataframe.

2) Open the url, stream in the file and extract it in the script and then read the CSV into a Pandas dataframe.

The issue with both of these approaches is the zip file contains contents other than the csv file I need and I'm not sure how to specifically access this.

Another approach I considered was compressing just the individual csv I need before including it in the repository, but this seems to generate superfluous files:

('Multiple files found in compressed zip file %s', "['NSPCL_AUG19_UK_LU.csv', '__MACOSX/', '__MACOSX/._NSPCL_AUG19_UK_LU.csv']") 

so I have the same issue with not being able to point directly to the file I need.

Please let me know what best practice is and how to get the file I need into a Pandas dataframe.

user3058703
  • 571
  • 1
  • 8
  • 22
  • Have you read, [this](https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas) discussion? – tbhaxor Nov 15 '19 at 19:10
  • @GurkiratSingh OP's problem may not be how to read the csv file in, but how to specifically get the desired csv file from a zip file without costing much disk space in the AWS EC2 instance. – ytu Nov 15 '19 at 19:17
  • About compressing the individual csv, what tool did you use to compress it that created the superfluous files? – iliar Nov 15 '19 at 22:17
  • @ytu I think the problem is that he does not know know how to read the csv, in a zip file contains multiple csv file and also Operating system metadata (OSX). his 2) solution should not cost disk that much, as he can deleted after load. – hunzter Nov 16 '19 at 10:39

2 Answers2

2

You could do the following to extract the main csv file and load it directly into a data frame:

from urllib.request import urlopen
import zipfile, io

url = 'https://www.arcgis.com/sharing/rest/content/items/19fac93960554b5e90840505bd73917f/data'

archive = zipfile.ZipFile(io.BytesIO(urlopen(url).read())) # Takes some time

csv_path = 'Data/NSPL_AUG_2019_UK.csv' # The desired csv file in the archive

df = pd.read_csv(io.BytesIO(archive.read(csv_path))) # Takes some time
>>> df
             pcd      pcd2     pcds  ...   imd     calncv        stp
0        AB1 0AA  AB1  0AA  AB1 0AA  ...  6808  S99999999  S99999999
1        AB1 0AB  AB1  0AB  AB1 0AB  ...  6808  S99999999  S99999999
2        AB1 0AD  AB1  0AD  AB1 0AD  ...  6808  S99999999  S99999999
3        AB1 0AE  AB1  0AE  AB1 0AE  ...  5503  S99999999  S99999999
4        AB1 0AF  AB1  0AF  AB1 0AF  ...  6668  S99999999  S99999999
...          ...       ...      ...  ...   ...        ...        ...
2632799  ZE3 9JW  ZE3  9JW  ZE3 9JW  ...  4187  S99999999  S99999999
2632800  ZE3 9JX  ZE3  9JX  ZE3 9JX  ...  4187  S99999999  S99999999
2632801  ZE3 9JY  ZE3  9JY  ZE3 9JY  ...  4187  S99999999  S99999999
2632802  ZE3 9JZ  ZE3  9JZ  ZE3 9JZ  ...  4187  S99999999  S99999999
2632803  ZE3 9XP  ZE3  9XP  ZE3 9XP  ...  4187  S99999999  S99999999

[2632804 rows x 41 columns]

Now you could store the data frame on its own in a compressed file, as suggested by iliar:

df.to_pickle('NSPL_AUG_2019_UK.pkl', compression='gzip') # Takes some more time

When I tried this the resulting file size was around 69MB, so you shouldn't have any trouble hosting it in the cloud.

To access the data frame again, simply load the pickled file:

>>> df = pd.read_pickle('NSPL_AUG_2019_UK.pkl')
             pcd      pcd2     pcds  ...   imd     calncv        stp
0        AB1 0AA  AB1  0AA  AB1 0AA  ...  6808  S99999999  S99999999
1        AB1 0AB  AB1  0AB  AB1 0AB  ...  6808  S99999999  S99999999
2        AB1 0AD  AB1  0AD  AB1 0AD  ...  6808  S99999999  S99999999
3        AB1 0AE  AB1  0AE  AB1 0AE  ...  5503  S99999999  S99999999
4        AB1 0AF  AB1  0AF  AB1 0AF  ...  6668  S99999999  S99999999
...          ...       ...      ...  ...   ...        ...        ...
2632799  ZE3 9JW  ZE3  9JW  ZE3 9JW  ...  4187  S99999999  S99999999
2632800  ZE3 9JX  ZE3  9JX  ZE3 9JX  ...  4187  S99999999  S99999999
2632801  ZE3 9JY  ZE3  9JY  ZE3 9JY  ...  4187  S99999999  S99999999
2632802  ZE3 9JZ  ZE3  9JZ  ZE3 9JZ  ...  4187  S99999999  S99999999
2632803  ZE3 9XP  ZE3  9XP  ZE3 9XP  ...  4187  S99999999  S99999999

[2632804 rows x 41 columns]
Seb
  • 4,422
  • 14
  • 23
  • Thank you so much for this thorough answer. I had come across the term pickling before but had no idea what it was – user3058703 Nov 17 '19 at 14:53
1

I'm not sure I understand your question, but once you've loaded the the needed csv into a pandas data frame, you can then save this data frame in a compressed file, by using pandas.DataFrame.to_pickle

df.to_pickle('df1.gz',compression = 'gzip')

And then when loading it

df = pandas.read_pickle('df1.gz', compression='gzip')

If you don't want to keep the gz file on your hard drive, you can store it somewhere online, then use urllib to download it, use gzip to decompress it and finally unpickle it:

from urllib.request import urlopen
import gzip
import pandas
import pickle

df = pickle.loads(gzip.decompress(urlopen('http://someUrl.com/file.gz').read()))
iliar
  • 932
  • 6
  • 11