3

I'm currently working on a small project which requires loading messages from a file. The messages are stored sequentially in the file and files can become huge, so loading the entire file content into memory is unrewarding.

Therefore we decided to implement a FileReader class that is capable of moving to specific elements in the file quickly and load them on request. Commonly used something along the following lines

SpecificMessage m;
FileReader fr;
fr.open("file.bin");
fr.moveTo(120); // Move to Message #120
fr.read(&m);    // Try deserializing as SpecificMessage 

The FileReader per se works great. Therefore we thought about adding STL compliant iterator support as well: A random access iterator that provides read-only references to specific messages. Used in the following way

for (auto iter = fr.begin<SpecificMessage>(); iter != fr.end<SpecificMessage>(); ++iter) {
  // ...
}

Remark: the above assumes that the file only contains messages of type SpecificMessage. We've been using boost::iterator_facade to simplify the implementation.

Now my question boils down to: how to implement the iterator correctly? Since FileReader does not actually hold a sequence of messages internally, but loads them on request.

What we've tried so far:

Storing the message as an iterator member

This approach stores the message in the iterator instance. Which works great for simple use-cases but fails for more complex uses. E.g. std::reverse_iterator has a dereference operation that looks like this

 reference operator*() const
 {  // return designated value
   _RanIt _Tmp = current;
   return (*--_Tmp);
 }

This breaks our approach as a reference to a message from a temporary iterator is returned.

Making the reference type equal the value type

@DDrmmr in the comments suggested making the reference type equal the value type, so that a copy of the internally stored object is returned. However, I think this is not valid for the reverse iterator which implements the -> operator as

pointer operator->() const {
  return (&**this);
}

which derefs itself, calls the *operator which then returns a copy of a temporary and finally returns the address of this temporary.

Storing the message externally

Alternatively I though about storing the message externally:

SpecificMessage m;
auto iter = fr.begin<SpecificMessage>(&m);
// ...

which also seems to be flawed for

auto iter2 = iter + 2

which will have both iter2 and iter point to the same content.

chrish.
  • 715
  • 5
  • 11
  • Promises are the concept we use to indicate that data will be promised but not necessarily currently available. Might I suggest adapting the iterator to return promises or a similar concept to lazily move around the file? – BlamKiwi Nov 09 '14 at 10:55
  • Why not just return the message by value? – D Drmmr Nov 09 '14 at 12:03
  • @MorphingDragon not sure what you mean. Do you have some references? – chrish. Nov 09 '14 at 12:34
  • @DDrmmr good idea. But I just discovered that the reverse iterator does the following pointer operator->() const {return (&**this);} which means it calls its own *operator generating a value copy and returning then the address to it. – chrish. Nov 09 '14 at 12:55
  • @ChristophHeindl Do you know the number of messages in advance ? – Oncaphillis Nov 09 '14 at 13:51
  • Yes I do, FileReader peeks at the file content when opening a file. – chrish. Nov 09 '14 at 15:09

4 Answers4

2

You are having issues because your iterator does not conform to the forward iterator requirements. Specifically:

  • *i must be an lvalue reference to value_type or const value_type ([forward.iterators]/1.3)
  • *i cannot be a reference to an object stored in the iterator itself, due to the requirement that two iterators are equal if and only if they are bound to the same object ([forward.iterators]/6)

Yes, these requirements are a huge pain in the butt, and yes, that means that things like std::vector<bool>::iterator are not random access iterators even though some standard library implementations incorrectly claim that they are.


EDIT: The following suggested solution is horribly broken, in that dereferencing a temporary iterator returns a reference to an object that may not live until the reference is used. For example, after auto& foo = *(i + 1); the object referenced by foo may have been released. The implementation of reverse_iterator referenced in the OP will cause the same problem.

I'd suggest that you split your design into two classes: FileCache that holds the file resources and a cache of loaded messages, and FileCache::iterator that holds a message number and lazily retrieves it from the FileCache when dereferenced. The implementation could be something as simple as storing a container of weak_ptr<Message> in FileCache and a shared_ptr<Message> in the iterator: Simple demo

Casey
  • 41,449
  • 7
  • 95
  • 125
  • that's what I'm suspecting. Thanks for the clarification of the requirements. Thanks also for the demo. I will accept this answer. One more side-question. – chrish. Nov 09 '14 at 17:05
  • One more note. In the example, wouldn't you think that only const iterators make sense? – chrish. Nov 09 '14 at 17:08
  • @ChristophHeindl Yes, although one could feasibly implement a custom deleter for the `shared_ptr`s that does write-back to the backing file. I'll leave that as an exercise for the reader ;) – Casey Nov 09 '14 at 17:21
  • @ChristophHeindl You should also note that there are thread-safety concerns with the usage of the `mutable` member, and that I specifically avoided `std::make_shared` so that the `weak_ptr` control blocks wouldn't keep the `Message` memory allocations hanging around. – Casey Nov 09 '14 at 17:23
  • @ChristophHeindl Bad news, I'm an idiot and my suggested approach is unworkable. – Casey Nov 09 '14 at 17:44
  • I'll keep thinking, but for now the best recommendation I have is to memory map the entire file, and keep an index (which could be constructed lazily) that maps message numbers to `Message*`s within the mapped file. – Casey Nov 09 '14 at 18:02
  • thanks for your efforts. Not sure if that helps, but I keep thinking to return a proxy_reference instead of a plain reference that still holds the shared pointer internally and a cast operator reveals the actual referene. – chrish. Nov 09 '14 at 18:23
2

As I hinted in my other answer, you could consider using memory mapped files. In the comment you asked:

As far as memory mapped files is concerned, this seems not what I want to have, as how would you provide an iterator over SpecificMessages for them?

Well, if your SpecificMessage is a POD type, you could just iterate over the raw memory directly. If not, you could have a deserialization helper (as you already have) and use Boost transform_iterator to do the deserialization on demand.

Note that we can make the memory mapped file managed, effectively meaning that you can just use it as a regular heap, and you can store all standard containers. This includes node-based containers (map<>, e.g.), dynamic-size containers (e.g. vector<>) in addition to the fixed-size containers (array<>) - and any combinations of those.

Here's a demo that takes a simple SpecificMessage that contains a string, and (de)derializes it directly into shared memory:

using blob_t       = shm::vector<uint8_t>;
using shared_blobs = shm::vector<blob_t>;

The part that interests you would be the consuming part:

bip::managed_mapped_file mmf(bip::open_only, DBASE_FNAME);
shared_blobs* table = mmf.find_or_construct<shared_blobs>("blob_table")(mmf.get_segment_manager());

using It = boost::transform_iterator<LazyLoader<SpecificMessage>, shared_blobs::const_reverse_iterator>;

// for fun, let's reverse the blobs
for (It first(table->rbegin()), last(table->rend()); first < last; first+=13)
    std::cout << "blob: '" << first->contents << "'\n";

// any kind of random access is okay, though:
auto random = rand() % table->size();
SpecificMessage msg;
load(table->at(random), msg);
std::cout << "Random blob #" << random << ": '" << msg.contents << "'\n";

So this prints each 13th message, in reverse order, followed by a random blob.

Full Demo

The sample online uses the lines of the sources as "messages".

Live On Coliru

#include <boost/interprocess/file_mapping.hpp>
#include <boost/interprocess/managed_mapped_file.hpp>
#include <boost/container/scoped_allocator.hpp>
#include <boost/interprocess/containers/vector.hpp>
#include <iostream>

#include <boost/iterator/transform_iterator.hpp>
#include <boost/range/iterator_range.hpp>

static char const* DBASE_FNAME = "database.map";

namespace bip = boost::interprocess;

namespace shm {
    using segment_manager = bip::managed_mapped_file::segment_manager;
    template <typename T> using allocator = boost::container::scoped_allocator_adaptor<bip::allocator<T, segment_manager> >;
    template <typename T> using vector    = bip::vector<T, allocator<T> >;
}

using blob_t       = shm::vector<uint8_t>;
using shared_blobs = shm::vector<blob_t>;

struct SpecificMessage {
    // for demonstration purposes, just a string; could be anything serialized
    std::string contents;

    // trivial save/load serialization code:
    template <typename Blob>
    friend bool save(Blob& blob, SpecificMessage const& msg) {
        blob.assign(msg.contents.begin(), msg.contents.end());
        return true;
    }

    template <typename Blob>
    friend bool load(Blob const& blob, SpecificMessage& msg) {
        msg.contents.assign(blob.begin(), blob.end());
        return true;
    }
};

template <typename Message> struct LazyLoader {
    using type = Message;

    Message operator()(blob_t const& blob) const {
        Message result;
        if (!load(blob, result)) throw std::bad_cast(); // TODO custom excepion
        return result;
    }
};

///////
// for demo, create some database contents
void create_database_file() {
    bip::file_mapping::remove(DBASE_FNAME);
    bip::managed_mapped_file mmf(bip::open_or_create, DBASE_FNAME, 1ul<<20); // Even sparse file size is limited on Coliru

    shared_blobs* table = mmf.find_or_construct<shared_blobs>("blob_table")(mmf.get_segment_manager());

    std::ifstream ifs("main.cpp");
    std::string line;
    while (std::getline(ifs, line)) {
        table->emplace_back();
        save(table->back(), SpecificMessage { line });
    }

    std::cout << "Created blob table consisting of " << table->size() << " blobs\n";
}

///////

void display_random_messages() {
    bip::managed_mapped_file mmf(bip::open_only, DBASE_FNAME);
    shared_blobs* table = mmf.find_or_construct<shared_blobs>("blob_table")(mmf.get_segment_manager());

    using It = boost::transform_iterator<LazyLoader<SpecificMessage>, shared_blobs::const_reverse_iterator>;

    // for fun, let's reverse the blobs
    for (It first(table->rbegin()), last(table->rend()); first < last; first+=13)
        std::cout << "blob: '" << first->contents << "'\n";

    // any kind of random access is okay, though:
    auto random = rand() % table->size();
    SpecificMessage msg;
    load(table->at(random), msg);
    std::cout << "Random blob #" << random << ": '" << msg.contents << "'\n";
}

int main()
{
#ifndef CONSUMER_ONLY
    create_database_file();
#endif

    srand(time(NULL));
    display_random_messages();
}
sehe
  • 374,641
  • 47
  • 450
  • 633
1

I have to admit I may not fully understand the trouble you have with holding the current MESSAGE as a member of Iter. I would associate each iterator with the FileReader it should read from and implement it as a lightweight encapsulation of a read index for FileReader::(read|moveTo). The most important method to overwtite is boost::iterator_facade<...>::advance(...) which modifies the current index and tries to pull a new MESSAGE from the FileReader If this fails it flags the the iterator as invalid and dereferencing will fail.

template<class MESSAGE,int STEP>           
class message_iterator; 

template<class MESSAGE> 
class FileReader { 
public: 
    typedef message_iterator<MESSAGE, 1> const_iterator; 
    typedef message_iterator<MESSAGE,-1> const_reverse_iterator; 

    FileReader(); 
    bool open(const std::string  & rName); 
    bool moveTo(int n); 
    bool read(MESSAGE &m); 

    // get the total count of messages in the file 
    // helps us to find end() and rbegin() 
    int getMessageCount(); 

    const_iterator begin() {                                           
        return const_iterator(this,0); 
    } 
    const_iterator end() { 
        return const_iterator(this,getMessageCount()); 
    } 
    const_reverse_iterator rbegin() { 
        return const_reverse_iterator(this,getMessageCount()-1); 
    } 
    const_reverse_iterator rend() { 
        return const_reverse_iterator(this,-1); 
    } 
}; 

// declaration of message_iterator moving over MESSAGE 
// STEP is used to specify STEP size and direction (e.g -1 == reverse) 
template<class MESSAGE,int STEP=1>                                                 
class message_iterator 
    : public boost::iterator_facade< 
    message_iterator<MESSAGE> 
    , const MESSAGE  
    , boost::random_access_traversal_tag 
    > 
{ 
    typedef  boost::iterator_facade< 
        message_iterator<MESSAGE> 
        , const MESSAGE 
        , boost::random_access_traversal_tag 
        > super; 

public:                                                               
    // constructor associates an iterator with its FileReader and a given position 
    explicit message_iterator(FileReader<MESSAGE> * p=NULL,int n=0): _filereader(p),_idx(n),_valid(false)    { 
        advance(0); 
    } 
    bool equal(const message_iterator & i) const { 
        return i._filereader == _filereader && i._idx == _idx; 
    } 
    void increment() { 
        advance(+1); 
    } 
    void decrement() { 
        advance(-1); 
    } 

    // overwrite with central functionality. Move to a given relative 
    // postion and check wether the position can be read. If move/read 
    // fails we flag the iterator as incalid. 

    void advance(int n) { 
        _idx += n*STEP; 
        if(_filereader!=NULL) { 
            if( _filereader->moveTo( _idx ) && _filereader->read(_m)) { 
                _valid = true; 
                return; 
            } 
        } 
        _valid = false; 
    } 
    // Return a ref to the currently cached MESSAGE. Throw 
    // an acception if positioning at this location in advance(...) failes. 
    typename super::reference dereference() const { 
        if(!_valid) { 
            throw std::runtime_error("access to invalid pos"); 
        } 
        return _m; 
    } 

private: 
    FileReader<MESSAGE> * _filereader; 
    int                   _idx; 
    bool                  _valid; 
    MESSAGE               _m; 
}; 
Oncaphillis
  • 1,888
  • 13
  • 15
  • Thanks for your input. That exactly mimics what we currently have, except on how you implement the reverse iterator. I'm using std::reverse_iterator adaptor. And all my issues stem from its implementation (which i suppose agrees with STL) and therefore my iterator is flawed. I understand that I could circumvent this with the solution you provide (STEP=-1), but I think that other STL compliant things will fail too. – chrish. Nov 09 '14 at 16:59
1

Boost PropertyMap

You could avoid writing the bulk of the code using Boost PropertyMap:

Live On Coliru

#include <boost/property_map/property_map.hpp>
#include <boost/property_map/function_property_map.hpp>

using namespace boost;

struct SpecificMessage {
    // add some data
    int index; // just for demo
};

template <typename Message>
struct MyLazyReader {
    typedef Message type;
    std::string fname;

    MyLazyReader(std::string fname) : fname(fname) {}

    Message operator()(size_t index) const { 
        Message m;
        // FileReader fr;
        // fr.open(fname);
        // fr.moveTo(index);     // Move to Message 
        // fr.read(&m);          // Try deserializing as SpecificMessage  
        m.index = index; // just for demo
        return m;
    }
};

#include <iostream>

int main() {

    auto lazy_access = make_function_property_map<size_t>(MyLazyReader<SpecificMessage>("file.bin"));

    for (int i=0; i<10; ++i)
        std::cout << lazy_access[rand()%256].index << "\n";
}

Sample output is

103
198
105
115
81
255
74
236
41
205

Using Memory Mapped Files

You could store a map of index -> BLOB objects in a shared vector<array<byte, N>>, flat_map<size_t, std::vector<uint8_t> > or similar.

So, now you only have to deserialize from myshared_map[index].data() (begin() and end() in case the BLOB size varies)

sehe
  • 374,641
  • 47
  • 450
  • 633
  • Hi, thanks for the feedback. Indeed function_property_map avoids most of the C++ bulk of the code. However, a secondary function in our project works with iterators over ranges. As I'd like to use that function would still like to have an iterator. As far as memory mapped files is concerned, this seems not what I want to have, as how would you provide an iterator over SpecificMessages for them? – chrish. Nov 10 '14 at 16:33
  • @ChristophHeindl I've taken the liberty of demonstrating this in another answer: **[using Memory Mapped File](http://stackoverflow.com/a/26855444/85371)**. – sehe Nov 11 '14 at 00:23