8

I have data stored in either a collection of files or in a single compound file. The compound file is formed by concatenating all the separate files, and then preceding everything with a header that gives the offsets and sizes of the constituent parts. I'd like to have a file-like object that presents a view of the compound file, where the view represents just one of the member files. (That way, I can have functions for reading the data that accept either a real file object or a "view" object, and they needn't worry about how any particular dataset is stored.) What library will do this for me?

The mmap class looked promising since it's constructed from a file, a length, and an offset, which is exactly what I have, but the offset needs to be aligned with the underlying file system's allocation granularity, and the files I'm reading don't meet that requirement. The name of the MultiFile class fits the bill, but it's tailored for attachments in e-mail messages, and my files don't have that structure.

The file operations I'm most interested in are read, seek, and tell. The files I'm reading are binary, so the text-oriented functions like readline and next aren't so crucial. I might eventually also need write, but I'm willing to forego that feature for now since I'm not sure how appending should behave.

Rob Kennedy
  • 161,384
  • 21
  • 275
  • 467
  • 1
    Can you just wrap a file object in a convenience class which has `read`, `seek` and `tell` methods which calculate the actual file position from the pseudo-position? – mgilson Jul 03 '12 at 15:01
  • Also, how big are the files? Are they small enough to fit comfortably in memory? If that's the case, you may be able to chunk them up using `StringIO` – mgilson Jul 03 '12 at 15:06
  • _"[..]but the offset needs to be aligned with the underlying file system's allocation granularity, and the files I'm reading don't meet that requirement."_ ... can you clarify this? – Burhan Khalid Jul 03 '12 at 15:09
  • How can this be useful if you have to read the file to know what are the offset/length of the chunks? Or are they at specific positions, and you `seek` there and `read(1)`? – jadkik94 Jul 03 '12 at 15:14
  • @Burhan, with `mmap`, I could map a view of the compound file and pass that mapped view around as a file itself, but `mmap` requires that each mapped chunk of file start at a multiple of the allocation granularity, say, 4 KB. My sub-files can start at any offset within the compound file, so `mmap` won't work. – Rob Kennedy Jul 03 '12 at 15:18
  • @Jadkik, I only have to read the *header* of the file to know the offsets and sizes of the other components. I don't need to read the entire file. Using that, I could create a view like this: `fv = FileView(file, offset, length)`. – Rob Kennedy Jul 03 '12 at 15:23
  • @Mgilson, if a simple is wrapper is all it takes, then I guess I can write it myself. The files could be multiple gigabytes, so I'd prefer not to load the whole thing, or even a sub-file, into memory. – Rob Kennedy Jul 03 '12 at 15:24
  • @RobKennedy Sorry, you said `preceding everything with a header`, I missed the "everything" part, I thought each chunk was preceded by a header. – jadkik94 Jul 03 '12 at 15:49

2 Answers2

6

I know you were searching for a library, but as soon as I read this question I thought I'd write my own. So here it is:

import os

class View:
    def __init__(self, f, offset, length):
        self.f = f
        self.f_offset = offset
        self.offset = 0
        self.length = length

    def seek(self, offset, whence=0):
        if whence == os.SEEK_SET:
            self.offset = offset
        elif whence == os.SEEK_CUR:
            self.offset += offset
        elif whence == os.SEEK_END:
            self.offset = self.length+offset
        else:
            # Other values of whence should raise an IOError
            return self.f.seek(offset, whence)
        return self.f.seek(self.offset+self.f_offset, os.SEEK_SET)

    def tell(self):
        return self.offset

    def read(self, size=-1):
        self.seek(self.offset)
        if size<0:
            size = self.length-self.offset
        size = max(0, min(size, self.length-self.offset))
        self.offset += size
        return self.f.read(size)

if __name__ == "__main__":
    f = open('test.txt', 'r')

    views = []
    offsets = [i*11 for i in range(10)]

    for o in offsets:
        f.seek(o+1)
        length = int(f.read(1))
        views.append(View(f, o+2, length))

    f.seek(0)

    completes = {}
    for v in views:
        completes[v.f_offset] = v.read()
        v.seek(0)

    import collections
    strs = collections.defaultdict(str)
    for i in range(3):
        for v in views:
            strs[v.f_offset] += v.read(3)
    strs = dict(strs) # We want it to raise KeyErrors after that.

    for offset, s in completes.iteritems():
        print offset, strs[offset], completes[offset]
        assert strs[offset] == completes[offset], "Something went wrong!"

And I wrote another script to generate the "test.txt" file:

import string, random

f = open('test.txt', 'w')

for i in range(10):
    rand_list = list(string.ascii_letters)
    random.shuffle(rand_list)
    rand_str = "".join(rand_list[:9])
    f.write(".%d%s" % (len(rand_str), rand_str))

It worked for me. The files I tested on are not binary files like yours, and they're not as big as yours, but this might be useful, I hope. If not, then thank you, that was a good challenge :D

Also, I was wondering, if these are actually multiple files, why not use some kind of an archive file format, and use their libraries to read them?

Hope it helps.

Rob Kennedy
  • 161,384
  • 21
  • 275
  • 467
jadkik94
  • 7,000
  • 2
  • 30
  • 39
  • Thanks. This was helpful. It might be nice to use a better-defined compound-file format, but our product has been producing files like this for almost a decade, so it's too late to change now. I have to write code to handle what the files *are*, not for how I *wish* they were. – Rob Kennedy Jul 12 '12 at 19:57
4

Depending on how complicated you need this to be, something like this should work -- I've left off some of the details since I don't know how closely you need to emulate a file object (e.g, will you ever use obj.read(), or will you always use obj.read(nbytes)):

class FileView(object):
     def __init__(self,file,offset,length):
         self._file=file
         self._offset=offset
         self._length=length

     def seek(self,pos):
         #May need to get a little fancier here to support the second argument to seek.
         return self._file.seek(self._offset+pos)

     def tell(self):
         return self._file.tell()-self._offset

     def read(self,*args):
         #May need to get a little more complicated here to make sure that the number of
         #bytes read is smaller than the number of bytes available for this file
         return self._file.read(*args)
mgilson
  • 300,191
  • 65
  • 633
  • 696