0

I want to download a dataset from the UCI repository.

The dataset is in the tar.Z format, and ideally I'd like to read it in as a pandas data frame.

I've checked out uncompressing tar.Z file with python? which suggested the zgip library, so from https://docs.python.org/3/library/gzip.html I tried using the below code but I got an error message.

Thanks for any help!

import gzip
with gzip.open('https://archive.ics.uci.edu/ml/machine-learning-databases/diabetes/diabetes-data.tar.Z', 'rb') as f:
file_content = f.read()  

ERROR MESSAGE:
OSError: [Errno 22] Invalid argument: 'https://archive.ics.uci.edu/ml/machine-learning-databases/diabetes/diabetes-data.tar.Z'
Robbie
  • 275
  • 4
  • 20
  • `gzip.open` expects a filename and not a URL. If you must do this within Python see [download large file in python with requests](https://stackoverflow.com/questions/16694907/download-large-file-in-python-with-requests). You'd then need to pass the filename of the downloaded file to `gzip.open`. If it's a one-time download you may wish to save this uncompressed, then have your application load the uncompressed data when it runs, to avoid the overhead of decompressing the file on each run. – v25 Aug 04 '20 at 11:21

1 Answers1

0

I do not think that you can read the .Z data with any module in Python; you could browse Pypi, and see if there is a module for the .Z extension. You could however, use the command line to process the data.

import subprocess
from io import StringIO

data = subprocess.run(
    """curl https://archive.ics.uci.edu/ml/machine-learning-databases/diabetes/diabetes-data.tar.Z | 
    tar -xOvf diabetes-data.tar.Z --wildcards 'Diabetes-Data/data-*' """,
    shell=True,
    capture_output=True,
    text=True,
).stdout


df = pd.read_csv(StringIO(data), sep="\t", header=None)

df.head()

        0       1        2  3
0   04-21-1991  9:09    58  100
1   04-21-1991  9:09    33  009
2   04-21-1991  9:09    34  013
3   04-21-1991  17:08   62  119
4   04-21-1991  17:08   33  007

You can read this ebook for more on command line options.

sammywemmy
  • 27,093
  • 4
  • 17
  • 31
  • 1
    I get an error message when I run your code: EmptyDataError: No columns to parse from file – Robbie Aug 04 '20 at 13:28
  • No idea why you are having that. I ran the code on my PC and was able to generate the data. Do confirm that you keyed in the code correctly; if you are on jupyter, clear outputs and restart kernel to see if it fixes. You can download the file directly and extract to your desktop. Then read the data in with pandas. – sammywemmy Aug 04 '20 at 13:31