2

I have a very large number (>1000) of files, each about 20MB, which represent continuous time-series data saved in a simple binary format such that if I concatenate them all directly, I recover my full time series.

I would like to do this virtually in python, by using memmap to address each file and then concatenate them all on the fly into one big memmap.

Searching around SO suggests that np.concatenate will load them into memory, which I can't do. The question here seems to answer it in part, but the answer there assumes that I know how big my files are before concatenation, which is not necessarily true.

So, is there a general way to concatenate memmaps without knowing beforehand how big they are?

EDIT: it was pointed out that the linked question actually creates a concatenated file on disk. This is not something I want.

Community
  • 1
  • 1
KBriggs
  • 1,220
  • 2
  • 18
  • 43
  • Are these files created with `np.save`? That format writes a header block, followed by the bytes of the array's data buffer. The size of the data buffer is calculated from information in the header - the dtype and shape. – hpaulj Mar 04 '16 at 18:32
  • No, these files are created using a completely different software suite. There is no header. Just raw binary data. I know for a fact that I could physically concatenate these files on disk without reading their contents at all, and achieve what I want to do. But I would like to avoid that if there is a way. – KBriggs Mar 04 '16 at 19:42
  • 1
    But isn't that what the linked SO is doing - copying data from memmap to another larger one - i.e. copying 'physically' on the disk? I didn't see anything about creating soft links between different maps. – hpaulj Mar 04 '16 at 20:00
  • You're right, that's what is happening. I had not read carefully enough. Any suggestions for me? I edited. – KBriggs Mar 04 '16 at 20:02
  • @KBriggs were you able to do it virtually without actually concatenating it physically on the disk? – Banks Sep 06 '19 at 03:36
  • 1
    @Shub no, unfortunately. You can simulate the behavior by just memmapping all the files individually and writing a wrapper to convert a particular time point into a (file ID, file offset) tuple, though. Adds a small amount of overhead but not too bad. – KBriggs Sep 06 '19 at 13:52

0 Answers0