5

My question is similar to this, but I have not found any C++ references for this problem.

There is a list of big files to read and process. What is the best way to create an input stream that would get data from the files one by one, opening the next file automatically upon the end of the previous file? This stream will be given to a processing function which sequentially reads blocks of variable size, across file boundaries.

Community
  • 1
  • 1
xivaxy
  • 383
  • 3
  • 11
  • Well, the "Unixy" way would be to write your program as a filter (i.e. it reads from stdin and writes to stdout), and then use existing building blocks like `cat input_file*.dat | myprogram`. But without more details (i.e. are the files all in one directory with names that are glob-able, or are they spread out in various places, or the order needs to be different), it's hard to say more than that... – twalberg Jul 29 '16 at 17:51
  • You could create a new class derived from `std::istream` that contains a `std::vector` of `std::ifstream` that automatically switches to the next on EOF or read failure – KABoissonneault Jul 29 '16 at 17:52
  • gather them in buffer file, then read them after? so a 2 part operation – Charlie Jul 29 '16 at 18:08

2 Answers2

5

What you'll want to do is provide a type that inherits from std::basic_streambuf. There are many cryptic virtual member functions, the relevant ones of which for you are showmanyc(), underflow(), uflow(), and xsgetn(). You'll want to overload them to, on overflow, automatically open the next file in your list (if any).

Here is a sample implementation. We act as a std::filebuf and just keep a deque<string> of the next files we need to read:

class multifilebuf : public std::filebuf
{
public:
    multifilebuf(std::initializer_list<std::string> filenames)
    : next_filenames(filenames.begin() + 1, filenames.end())
    {   
        open(*filenames.begin(), std::ios::in);
    }   

protected:
    std::streambuf::int_type underflow() override
    {   
        for (;;) {
            auto res = std::filebuf::underflow();
            if (res == traits_type::eof()) {
                // done with this file, move onto the next one
                if (next_filenames.empty()) {
                    // super done
                    return res;
                }
                else {
                    // onto the next file
                    close();
                    open(next_filenames.front(), std::ios::in);

                    next_filenames.pop_front();
                    continue;
                }
            }
            else {
                return res;
            }
        }
    }   

private:
    std::deque<std::string> next_filenames;
};

That way, you can make everything transparent to your end user:

multifilebuf mfb{"file1", "file2", "file3"};

std::istream is(&mfb);
std::string word;
while (is >> word) {
    // transaparently read words from all the files
}
Barry
  • 286,269
  • 29
  • 621
  • 977
  • These things are going to be featured in the next questions I'll be asking to someone who claims to know everything about C++. Nice find! – KABoissonneault Jul 29 '16 at 18:11
  • @KABoissonneault Even went ahead and figured out how to make a working example. I guess this case isn't so bad, only needed `underflow()`. – Barry Jul 29 '16 at 18:44
0

For an easy solution, use boost's join with ranges of istream iterators for the files. I am unaware of a similar function in the current C++ library, but one probably exists in the TS Rangesv3.

You can also write it yourself: writing join yourself is perfectly possible.

I'd write it as a "flattening" input-only iterator -- an iterator over a range of ranges that iterates over the contents of each range in turn. The iterator would keep track of the future range of ranges, and an iterator for the current element.

Here is a very simple zip iterator to give you the idea of the magnitude of code you'd have to write (a zip iterator is a different concept, and this is a simple one only suitable for a for(:) loop).

This is a sketch of how you might do it using C++14:

template<class It>
struct range_t {
  It b{};
  It e{};
  It begin() const { return b; }
  It end() const { return e; }
  bool empty() const { return begin()==end(); }
};

template<class It>
struct range_of_range_t {
  std::deque<range_t<It>> ranges;
  It cur;
  friend bool operator==(range_of_range_t const& lhs, range_of_range_t const& rhs) {
    return lhs.cur==rhs.cur;
  }
  friend bool operator!=(range_of_range_t const& lhs, range_of_range_t const& rhs) {
    return !(lhs==rhs);
  }
  void operator++(){
    ++cur;
    if (ranges.front().end() == cur) {
      next_range();
    }
  }
  void next_range() {
    while(ranges.size() > 1) {
      ranges.pop_front();
      if (ranges.front().empty()) continue;
      cur = ranges.front().begin();
      break;
    }
  }
  decltype(auto) operator*() const {
    return *cur;
  }
  range_of_range_t( std::deque<range_t<It>> in ):
    ranges(std::move(in)),
    cur{}
  {
    // easy way to find the starting cur:
    ranges.push_front({});
    next_range();
  }
};

the iterator needs work, in that it should support all of the iterator axioms. And getting the end iterator right is a bit of work.

This isn't a strema, but rather an iterator.

Community
  • 1
  • 1
Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524