0

I have working c++ deamon. The problem is that daemon crash one or two times per month. As you see from GDB output below its happens when daemon search for unsigned int sessionID at std::map <unsigned int const, SessionID*> container. I can not reproduce the problem and think that its maybe something wrong with data from users (maybe std::sting cookie_ssid have some unexpected data and after conversion with strtoulsomething going wrong. (know, that this is not the correct way to get unsigned int )

I have only .core file after daemon crash. and see that problem at if (!_M_impl._M_key_compare(_S_key(__x), __k)). Any ideas how to solve this problem? Many thanks.

GDB output:

#0  std::_Rb_tree<unsigned int, std::pair<unsigned int const, SessionID*>, std::_Select1st<std::pair<unsigned int const, SessionID*> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, SessionID*> > >::find (this=0x529df15c, __k=@0x7f5fab7c)
    at stl_tree.h:1376
##########
1376            if (!_M_impl._M_key_compare(_S_key(__x), __k))
##########
#0  std::_Rb_tree<unsigned int, std::pair<unsigned int const, SessionID*>, std::_Select1st<std::pair<unsigned int const, SessionID*> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, SessionID*> > >::find (
    this=0x529df15c, __k=@0x7f5fab7c) at stl_tree.h:1376
#1  0x0805e6be in TR::find_session (this=0x529df110,
    cookie_ssid=@0x47ef3614, ptr_to_ptr_session=0x7f5fac7c)
    at stl_map.h:542

Function TR::find_session released like this:

bool TR::find_session ( const std::string &cookie_ssid, SessionID **ptr_to_ptr_session )
{
    unsigned int uint_sessionid = std::strtoul ( cookie_ssid.c_str(),NULL,0);

MUTEX_map_sessionids.lock_reading();
    std::map<unsigned int, SessionID*>::iterator it_sessionids = map_sessionids.find( uint_sessionid );

    if ( it_sessionids != map_sessionids.end() )
    { // exists
        *ptr_to_ptr_session = it_sessionids->second;
        MUTEX_map_sessionids.unlock();
        return true;
    }

MUTEX_map_sessionids.unlock();  
return false;   
}

EDIT My cleanup function that working at separated thread (one time per minute or 5 minutes). As requested at comments. I am not sure with this function. Maybe its buggy....

void TR::cleanup_sessions () // not protected from multithread using! used only at one thread
{
std::list<SessionID*> list_to_clean; // tmplary store sessions to delete

MUTEX_map_sessionids.lock_reading();
std::map<unsigned int, SessionID*>::iterator it_sessionids = map_sessionids.begin();
MUTEX_map_sessionids.unlock();

while ( true )
{
    MUTEX_map_sessionids.lock_writing();
    if (it_sessionids == map_sessionids.end() )
    {
        MUTEX_map_sessionids.unlock();
        break;
    }

    SessionID *ptr_sessionid = it_sessionids->second;

    time_t secondsnow = time (NULL);

    ptr_sessionid->MUTEX_all_session.lock_reading();
    time_t lastaccesstime = ptr_sessionid->last_access_time;
    size_t total_showed = ptr_sessionid->map_showed.size(); 
    ptr_sessionid->MUTEX_all_session.unlock();


    if ( lastaccesstime and secondsnow - lastaccesstime > LOCALSESSION_LIFETIME_SEC ) // lifetime end!
    {
        // delete session from map
        map_sessionids.erase( it_sessionids++ ); // Increments the iterator but returns the original value for use by erase
        MUTEX_map_sessionids.unlock();              


        list_to_clean.push_back ( ptr_sessionid ); // at the end
    }
    else if ( total_showed == 0 and secondsnow - lastaccesstime > 36000 ) // not active for N secontes
    {
        map_sessionids.erase( it_sessionids++ ); // Increments the iterator but returns the original value for use by erase
        MUTEX_map_sessionids.unlock();

        // add pointer to list to delete it latter
        list_to_clean.push_back ( ptr_sessionid ); // at the end            
    }
    else
    {
        ++it_sessionids; // next
        MUTEX_map_sessionids.unlock();              
    }

}

// used? pause
if ( !list_to_clean.empty() ) 
{
    //sleep(1);
}

// cleanup session deleted from working map
while ( !list_to_clean.empty() )
{
    SessionID *ptr_sessionid_to_delete = list_to_clean.front();
    list_to_clean.pop_front();

    ptr_sessionid_to_delete->MUTEX_all_session.lock_writing(); // protected lock session mutex. can not delete session if its already locked. (additational protection)
    ptr_sessionid_to_delete->cleanup();
    delete ptr_sessionid_to_delete;
}

}

NOTE as you can see on every itteration I lock/unlock map_sessions, because other threads at this time find/insert new sessions and its critical, because users can not wait.

abrahab
  • 2,430
  • 9
  • 39
  • 64
  • 2
    The presence of a mutex implies the code is multithreaded, so it's likely a synchronization bug (are you modifying the map somewhere without holding the mutex?). Also you should really be using RAII to lock/unlock the mutex, while it might be OK here, C++ code that manually locks/unlocks a mutex is unlikely to be exception safe. – user786653 Dec 06 '12 at 18:06
  • @user786653 i think no, its not the mutex problem. I very carefully watch for mutexes. And gdb report that its crashed at std::find or std::compare.... :( – abrahab Dec 06 '12 at 18:13
  • 1
    Have you tried running your daemon under helgrind, it can help catch things you might missed? (`valgrind --tool=helgrind ./your-daemon`). Also try inspecting some of the values in gdb (does e.g. `_M_impl` point to something sane? Is the session id reasonable, is it present in the map? – user786653 Dec 06 '12 at 18:23
  • You could spend less time watching mutexes if you used a scoped lock instead of manually locking and unlocking them. – Captain Obvlious Dec 06 '12 at 18:33
  • 1
    @abrahab, Please, show us all the places in your program where you modify `map_sessionids` (insert/delete/update data in the map). –  Dec 06 '12 at 18:46
  • @user786653 No, I does not use valgrind/helgrind :( How to inspect `_M_impl` with `gdb`? SessionID maybe or may not be at the map, because this `std::string` from users and I can not control what users can send to the server(daemon)... – abrahab Dec 06 '12 at 18:48
  • @skwllsp I recheck all places. All `map_sessionids` protected with mutexes(read/write). I am sure with the code where I generate UnsignedID and `.insert` to the `map`. its fully protected with `write_lock()`, but I am not so sure with thread where I cleanup `map` from old sessions. give we 5 minutes, I will cleanup add it to my first post. thanks for your help. – abrahab Dec 06 '12 at 18:56
  • @skwllsp added code where I cleanup the map. as you can see, I lock/unlock mutex at every iteration, because at other treads users connecting (creating/finding sessions) and they can not wait. – abrahab Dec 06 '12 at 19:10

1 Answers1

2

Note that any modification of a map may invalidate any iterator in that map. You have:

MUTEX_map_sessionids.lock_reading();
std::map<unsigned int, SessionID*>::iterator it_sessionids = map_sessionids.begin();
MUTEX_map_sessionids.unlock();

Now immediately after this unlock, some other thread might acquire the lock and do some operation that invalidates it_sessionids, which will leave it in a state where the subsequent code will corrupt the map, leading to a later crash.

You need to acquire AND HOLD a lock over the entire lifetime of the iterator. It looks like you have reader/writer locks, so you need only hold a read lock for the whole time, upgrading it to a write lock when you want to modify the map and then immediately downgrading it back to a read lock after the modification. Holding a read lock for a long time will only block other threads that want to acquire a write lock, not other threads that only need a read lock.

Answering your questions in the comment:

  1. If you cannot hold the lock for an extended period of time, then you can't have iterators that remain valid for an extended period of time. One thing you can do is occasionally remember approximately where you are in the map, release the lock (giving other threads a chance and invalidating the iterator), then reaquire the lock and create a new iterator at approximately the same point. You could add something like this in the middle of your loop:

    if (++count > limit) { // only do this every Nth iteration
        unsigned int now_at = it_sessionids->first;
        MUTEX_map_sessionids.unlock();
        // give others a chance
        MUTEX_map_sessionids.lock_reading();
        it_sessionids = map_sessionids.lower_bound(now_at);
        count = 0; }
    
  2. Upgrading a lock from read only to read/write is a primitive operation on reader/writer locks that your implementation may not support. If not then you're out of luck and need to hold a writer lock the entire time.

Chris Dodd
  • 119,907
  • 13
  • 134
  • 226
  • two questions 1) what to do if I can not hold lock for significant time? other threads writing/reading from this map and there actions must me more priority. 2) how to "upgrade" or "downgrade" locks(mutexex)? if I will unlock_read() than immediately lock_writing() - potentially other thread can get lock mutex between these actions – abrahab Dec 06 '12 at 20:06
  • about point 1: maybe it will be a good idea to start from begin() of the map, then every 1000 iterations unlock the mutex, remember the `sessionID` where iteration end. After that LOCK repeatedly mutex and continue with .find(SessionID) to get place for other 1000 unprocessed iterations (then loop)? seems its give some time for other threads to process and answer to clients... – abrahab Dec 06 '12 at 20:17
  • 1
    @Chris Dodd `Note that any modification of a map may invalidate any iterator in that map`. Could you please comment on this? I have read http://stackoverflow.com/questions/6438086/iterator-invalidation-rules and it seems that map iterators are not invalidated on `any modification`. –  Dec 07 '12 at 04:58
  • @skwlisp: Yes, there are certain mofications you can make that will not invalidate certain iterators. In practice MOST modifications will not invalidate MOST iterators -- that's why this bug only hits once a month or so. But without carefully checking EVERY OTHER modification of the map EVERYWHERE in the program you can't be sure you wont ever hit one of the problem cases. – Chris Dodd Dec 07 '12 at 18:32