3

I am using numpy.fromfile to construct an array which I can pass to the pandas.DataFrame constructor

import numpy as np
import pandas as pd

def read_best_file(file, **kwargs):
    '''
    Loads best price data into a dataframe
    '''
    names   = [ 'time', 'bid_size', 'bid_price', 'ask_size', 'ask_price' ]
    formats = [ 'u8',   'i4',       'f8',        'i4',       'f8'        ]
    offsets = [  0,      8,          12,          20,         24         ]

    dt = np.dtype({
            'names': names, 
            'formats': formats,
            'offsets': offsets 
        })
    return pd.DataFrame(np.fromfile(file, dt))

I would like to extend this method to work with gzipped files.

According to the numpy.fromfile documentation, the first parameter is file:

file : file or str
Open file object or filename

As such, I added the following to check for a gzip file path:

if isinstance(file, str) and file.endswith(".gz"):
    file = gzip.open(file, "r")

However, when I try pass this through the fromfile constructor I get an IOError:

IOError: first argument must be an open file

Question:

How can I call numpy.fromfile with a gzipped file?

Edit:

As per request in comments, showing implementation which checks for gzipped files:

def read_best_file(file, **kwargs):
    '''
    Loads best price data into a dataframe
    '''
    names   = [ 'time', 'bid_size', 'bid_price', 'ask_size', 'ask_price' ]
    formats = [ 'u8',   'i4',       'f8',        'i4',       'f8'        ]
    offsets = [  0,      8,          12,          20,         24         ]

    dt = np.dtype({
            'names': names, 
            'formats': formats,
            'offsets': offsets 
        })

    if isinstance(file, str) and file.endswith(".gz"):
        file = gzip.open(file, "r")

    return pd.DataFrame(np.fromfile(file, dt))
Steve Lorimer
  • 27,059
  • 17
  • 118
  • 213
  • We would need to see exactly how the check is implemented. – TheBlackCat Jun 27 '16 at 18:30
  • @TheBlackCat Literally before the return statement those 2 lines are inserted. – Steve Lorimer Jun 27 '16 at 18:34
  • Can you please show the complete code, with correct indentation? – TheBlackCat Jun 27 '16 at 18:35
  • @TheBlackCat what do you mean correct indentation - the indentation is correct – Steve Lorimer Jun 27 '16 at 18:36
  • Can you please edit your question to show the complete code with the change you have made. – TheBlackCat Jun 27 '16 at 18:36
  • @TheBlackCat I have! – Steve Lorimer Jun 27 '16 at 18:36
  • Try adding a dummy `print` function/statement inside the `if` test to make sure you are actually opening the file with `gzip`. – TheBlackCat Jun 27 '16 at 18:38
  • @TheBlackCat yes, I am opening the file with gzip - the if statement works, opening the file with gzip works... the problem is not there, it is with the fact that `numpy.fromfile` doesn't consider a `gzip file` an *open file object* – Steve Lorimer Jun 27 '16 at 18:40
  • Isn't there a `gzip` decompress method or option? – hpaulj Jun 27 '16 at 18:41
  • @hpaulj I have subsequently used `pd.DataFrame(np.fromstring(file.read(), dt))` which works. It does seem wasteful though as the `file.read()` will create a huge string and then `np.fromstring()` will create the array from the string. It would surely be more efficient to have `np.fromfile()` know how to read from a gzipped stream? – Steve Lorimer Jun 27 '16 at 18:45
  • For your purposes, maybe. But `fromfile` isn't billed as a general purpose file loader. It's a complement to the `tofile` method, and written in `c`. Look at `savez` and `load` if you want to work with compressed storage (they use `zip` archives). – hpaulj Jun 27 '16 at 19:23
  • http://stackoverflow.com/questions/12571913/python-unzipping-stream-of-bytes suggests `zlib` to decompress a stream, but I haven't read enough to see how that could be used in this context. – hpaulj Jun 27 '16 at 19:33
  • @hpaulj ok thanks - will look into the compressed versions – Steve Lorimer Jun 27 '16 at 19:46

2 Answers2

6

I have had success reading arrays of raw binary data from gzipped files by feeding the read() results through numpy.frombuffer(). This code works in Python 3.7.3, and perhaps in earlier versions also.

# Example: read short integers (signed) from gzipped raw binary file

import gzip
import numpy as np

fname_gzipped = 'my_binary_data.dat.gz'
raw_dtype = np.int16
with gzip.open(fname_gzipped, 'rb') as f:
    from_gzipped = np.frombuffer(f.read(), dtype=raw_dtype)

# Demonstrate equivalence with direct np.fromfile()
fname_raw = 'my_binary_data.dat'
from_raw = np.fromfile(fname_raw, dtype=raw_dtype)

# True
print('raw binary and gunzipped are the same: {}'.format(
    np.array_equiv(from_gzipped, from_raw)))

# False
wrong_dtype = np.uint8
binary_as_wrong_dtype = np.fromfile(fname_raw, dtype=wrong_dtype)
print('wrong dtype and gunzipped are the same: {}'.format(
    np.array_equiv(from_gzipped, binary_as_wrong_dtype)))

Rudi
  • 193
  • 1
  • 2
  • 8
4

open.gzip() doesn't return a true file object. It's duck one .. it walks like a duck, sounds like a duck, but isn't quite a duck per numpy. So numpy is being strict (since much is written in lower level C code, it might require an actual file descriptor.)

You can get the underlying file from the gzip.open() call, but that's just going to get you the compressed stream.

This is what I would do: I would use subprocess.Popen() to invoke zcat to uncompress the file as a stream.

>>> import subprocess
>>> p = subprocess.Popen(["/usr/bin/zcat", "foo.txt.gz"], stdout=subprocess.PIPE)
>>> type(p.stdout)
<type 'file'>
>>> p.stdout.read()
'hello world\n'

Now you can pass p.stdout as a file object to numpy:

np.fromfile(p.stdout, ...)
rrauenza
  • 6,285
  • 4
  • 32
  • 57
  • `fromfile` is doing its own file read in c code. It does not import or use the `gzip` module. – hpaulj Jun 27 '16 at 19:21
  • 1
    This doesn't work (for me), because the pipe that zcat's stdout is written to is not seekable. Therefore, np.fromfile raises `IOError: could not seek in file` – rodion Oct 21 '16 at 09:24
  • Ah, then you're going to have to either use a temporary file or python's stringio if your file will fit into memory. More discussion on gzip's lack of random access is discussed here: http://stackoverflow.com/questions/25985645/about-the-use-of-seek-on-gzip-files – rrauenza Oct 21 '16 at 20:11
  • 1
    `BytesIO` gives me the same problem as `open.gzip` – Mark Nov 17 '16 at 15:23