0

I would like to use boost::algorithm::knuth_morris_pratt over some huge files (serveral hundred gigabytes). This means I can't just read the whole file into memory nor mmap it, I need to read it in chunks.

knuth_morris_pratt operates on an iterator, so I guess it is possible to make it read input data "lazily" (on-demand), it would be a matter of writing a "lazy" iterator for some file access class like ifstream, or better istream.

I would like to know if there is some adapter available (already written) that adapts istream to Boost's knuth_morris_pratt so that it won't read all file data all at once?

I know there is a boost::spirit::istream_iterator, but it lacks some some methods (like operator+), so it would have to be modified to work.

On StackOverflow there's a implementation of bidirectional_iterator here, but it still requires some work before it can be used with knuth_morris_pratt.

Are there any istream iterators that are already written, tested and working?

Update: I can't do mmap, because my software should work on multiple operating systems, both on 32-bit and 64-bit architectures. Also very often I don't have the files anyway, they're being generated on-the-fly, that's why I search for a solution that involves streams.

Community
  • 1
  • 1
antonone
  • 2,045
  • 1
  • 25
  • 35
  • I can't help you with your original question, sorry. But are you sure that you can't `mmap` it? I successfully mapped files of about 80GB on a machine with 8GB RAM before. – anderas Jul 24 '15 at 09:10
  • I'm trying to figure out a way to search through huge binary disk images for my software, so I would like it to support multiple terabyte images as well, on 32-bit architecture :). Also, my input stream is sometimes generated dynamically by concatenating other huge files (something like RAID does), so often I don't have physical files anyway. – antonone Jul 24 '15 at 09:11
  • 1
    It is quite stupid that KMP from boost uses random access iterator. If you want to get indices of all occurences, it is very easy to write an implementation that uses only pointer increment, nothing else. Perhaps you could take working KMP algorithm from some other source and adapt it to your needs, instead of implementing random access iterator for large files. – stgatilov Jul 24 '15 at 10:24

1 Answers1

0

You should simply memory map it.

In practice, 64-bit processors usually have 48-bit address space which is enough for 256 terabytes of memory.

Last I checked, Linux allows 128TB of virtual address space per process on x86-64

(from https://superuser.com/a/168121/75721)

Spirit's istream_iterator is actually it's multi_pass_adaptor and it has different design goals. Unless you have a way to flush the buffer, it will grow indefinitely (by default allocating a deque buffer).

Community
  • 1
  • 1
sehe
  • 374,641
  • 47
  • 450
  • 633
  • Thanks for the feedback. I can't use `mmap` though. I've updated the question with some rationale why I can't use it. – antonone Jul 24 '15 at 09:24