C++ memcpy copy of object appears corrupted

Question

As part of a class project to implement a compiler I am also implementing a hash table for use as the compiler's symbol table.

The implementation of the hash table is intended to be very low-level, a raw block of memory that is manually packed, and it is only intended to store Token objects. Thus, to optimize the hash table for serializability, I decided to simply inline the Tokens in the table, that is, simply memcpy the Token object into the table's memory when a Token is first inserted.

I am aware that one should not memcpy a class that has virtual functions or pointers, and that in general using memcpy on objects of a class is bad practice. However, the Token class does not have virtual functions or pointers as can be seen from its declaration below, and I would not use memcpy if this was not an exercise in low-level programming.

class Token {

    public:
        Token() : tag(BAD) {}
        Token(Tag t) : tag(t) {}
        Tag tag;
        size_t getSize(){ return sizeof(Token); }

};

The problem that I am having, is that the hash table inserts the Tokens correctly, and it finds the same memory location when looking for the same key, however, when I try to access the members of the returned Token pointer, it appears that the data has been corrupted.

I wrote a program to test the symbol table on a simple input. The program does the following:

Read input file into buffer.
Initialize Lexer, by inserting all built-in Tokens into Lexer symbol table.
Call Lexer's getToken method on input and print the Token's tag until the EOF Token is read.

While the symbol table returns the same location in memory that it inserted the Token into, the Token's tag attribute no longer matches the original one that was inserted. Below is the log output when the program inserts the keyword program into the symbol table:

[debug]   Entering SymbolTable.insert(const char* key)
[debug]   bin: 48 Searching for the last element in this bin's linked list.
[debug]   Last entry found... writing to entry's next location field.
[debug]   Writing the new entry's identifier to the table.
[debug]   The identifier: program has been written to the table.
[debug]   The memory blocks are not equal prior to assignment.
[debug]   The memory blocks are equal.
[debug]   nextLoc: 571 tag: 46
[debug]   Location of Token: 14287688

And below is the log output when the program subsequently looks up the identifier program in the symbol table.

[debug]   Entering Lexer.getToken()
[debug]   Entering SymbolTable.contains(const char* key)
[debug]   Entering SymbolTable.find(const char* key) key: program
[debug]   bin: 48
[debug]   Search location: 541
[debug]   Comparing key char: p to table char: p
[debug]   Comparing key char: r to table char: a
[debug]   Tag of entry: 1684368227
[debug]   Location of Token: 14287653
[debug]   Search location: 557
[debug]   Comparing key char: p to table char: p
[debug]   Comparing key char: r to table char: r
[debug]   Comparing key char: o to table char: o
[debug]   Comparing key char: g to table char: c
[debug]   Tag of entry: 1920296037
[debug]   Location of Token: 14287668
[debug]   Search location: 0
[debug]   Comparing key char: p to table char: p
[debug]   Comparing key char: r to table char: r
[debug]   Comparing key char: o to table char: o
[debug]   Comparing key char: g to table char: g
[debug]   Comparing key char: r to table char: r
[debug]   Comparing key char: a to table char: a
[debug]   Comparing key char: m to table char: m
[debug]   Tag of entry: 1207959598
[debug]   Location of Token: 14287688
The 1th token: 60

So as can be seen from the Location of Token message, the symbol table is finding the same location in memory where it wrote the Token to, but the tag of the Token is different. I'm stumped as to why this might be.

For completeness, here is the definition of the SymbolTable class.

template<class sizeType>    
class SymbolTable{      

    public:    
        SymbolTable();                                                             
        ~SymbolTable();        
        Token* operator[](const char* key);    
        bool contains(const char* key);    
        Token* insert(const char* key, Token value);    

    private:    
        void* find(const char* key);                                                        
        static const size_t nbins = 64;    
        sizeType nextLoc;                                                      
        void* table;       
        size_t hash(const char* key){    
            return (size_t)key[0] % nbins;    
        }            

};

And here is the source code for the symbol table's insert, find and operator[] functions.

template<class sizeType> Token* SymbolTable<sizeType>::insert(const char* key, Token value){

    BOOST_LOG_TRIVIAL(debug) << "Entering SymbolTable.insert(const char* key)";
    size_t bin = hash(key);
    void *sizeType_ptr,
         *tbl_char_ptr,
         *idSize_ptr;
    unsigned char idSize = 0;
    const char *key_ptr = key;
    Token *token_ptr = NULL;

    // Find the last entry in this bin's linked list.
    BOOST_LOG_TRIVIAL(debug) << "bin: " << bin
                             << " Searching for the last element in this bin's linked list.";
    sizeType_ptr = table + sizeof(sizeType)*bin;

    while(*((sizeType*)sizeType_ptr) != 0){
        sizeType_ptr = table + *((sizeType*)sizeType_ptr);
    }

    BOOST_LOG_TRIVIAL(debug) << "Last entry found... writing to entry's next location field.";
    // Write the location of the new entry to this entry's next field.
    *((sizeType*)sizeType_ptr) = nextLoc;

    // Move to new entry's location.
    sizeType_ptr = table + nextLoc;

    // Write identifier
    BOOST_LOG_TRIVIAL(debug) << "Writing the new entry's identifier to the table.";
    tbl_char_ptr = sizeType_ptr + sizeof(sizeType) + sizeof(unsigned char);
    while(*key_ptr != '\0'){
        *((char*)tbl_char_ptr) = *key_ptr;
        tbl_char_ptr = tbl_char_ptr + sizeof(char);
        ++key_ptr;
        ++idSize;
    }

    BOOST_LOG_TRIVIAL(debug) << "The identifier: " << key << " has been written to the table.";

    // Write length of identifier.
    idSize_ptr = sizeType_ptr + sizeof(sizeType);
    *((unsigned char*)idSize_ptr) = idSize;
    token_ptr = (Token*)(tbl_char_ptr + sizeof(char));

    void *tk = &value,
         *tb = token_ptr;
    for(int i = 0; i < value.getSize(); ++i){
        if(*((char*)tk) != *((char*)tb)){
            BOOST_LOG_TRIVIAL(debug) << "The memory blocks are not equal prior to assignment.";
            break;
        }
    }

    memcpy((void*)token_ptr, &value, value.getSize());

    bool areEqual = true;
    for(int i = 0; i < value.getSize(); ++i){
        if(*((char*)tk) != *((char*)tb)){
            areEqual = false;
            BOOST_LOG_TRIVIAL(debug) << "The memory blocks are not after assignment.";
            break;
        }
    }
    if(areEqual){
        BOOST_LOG_TRIVIAL(debug) << "The memory blocks are equal.";
    }

    nextLoc = nextLoc + sizeof(sizeType) + sizeof(unsigned char)
                + idSize*sizeof(char) + value.getSize();
    BOOST_LOG_TRIVIAL(debug) << "nextLoc: " << nextLoc
                             << " tag: " << token_ptr->tag;
    BOOST_LOG_TRIVIAL(debug) << "Location of Token: " << (size_t)token_ptr;
    return token_ptr;

}

template<class sizeType>
void* SymbolTable<sizeType>::find(const char* key){

    BOOST_LOG_TRIVIAL(debug) << "Entering SymbolTable.find(const char* key) "
                             << "key: " << key;
    bool found = false;
    size_t bin = hash(key);
    void *idSize,
         *sizeType_ptr,
         *tbl_char_ptr,
         *result_ptr = NULL;
    const char* key_ptr = key;

    BOOST_LOG_TRIVIAL(debug) << "bin: " << bin;
    sizeType_ptr = table + sizeof(sizeType)*bin;


    while(!found){

        found = true;
        // Is this the last element in this bin?
        if(*((sizeType*)sizeType_ptr) == 0){ 
            result_ptr = NULL;
            return result_ptr;
        }
        // Advance to the next element in this bin's linked list.
        sizeType_ptr = table + *((sizeType*)sizeType_ptr);
        idSize = sizeType_ptr + sizeof(sizeType);
        tbl_char_ptr = idSize + sizeof(unsigned char);

        BOOST_LOG_TRIVIAL(debug) << "Search location: " << *((sizeType*)sizeType_ptr);
        // Check if the passed key matches the current key in the table.
        for(int i = 0; i < *((unsigned char*)idSize); ++i){

            BOOST_LOG_TRIVIAL(debug) << "Comparing key char: " << *key_ptr
                                     << "to table char: " << *((const char*)tbl_char_ptr);
            // Check if the key is too short or if the chars do not match.
            if(*key_ptr != *((const char*)tbl_char_ptr)){
                found = false;
                break;
            }   

            ++key_ptr;
            tbl_char_ptr = tbl_char_ptr + sizeof(char);
            BOOST_LOG_TRIVIAL(debug) << "*(const char*)tbl_char_ptr: "
                                     << *((const char*)tbl_char_ptr);

        }

        result_ptr = tbl_char_ptr + sizeof(char);
        BOOST_LOG_TRIVIAL(debug) << "Tag of entry: " << ((Token*)result_ptr)->tag;
        BOOST_LOG_TRIVIAL(debug) << "Location of Token: " << (size_t)result_ptr;
        key_ptr = key;

    }   

    return result_ptr;

}

template<class sizeType>
Token* SymbolTable<sizeType>::operator[](const char* key){

    BOOST_LOG_TRIVIAL(debug) << "Entering SymbolTable.operator[](const char* key)";
    void* void_ptr = find(key);
    Token* token_ptr = (Token*)void_ptr;
    return token_ptr;

}

And here is the test program's source code.

int main(){

    cout << "Executing testLexer.cpp" << endl;

    ifstream file("./pascalPrograms/helloworld.pas");
    string program((istreambuf_iterator<char>(file)), istreambuf_iterator<char>());
    cout << "program:\n\n" << program << endl;
    int fileSize = program.length();
    const char* buffer = program.c_str();

    const char* scanp = buffer;

    cout << "Instantiating Lexer" << endl << endl;
    Lexer lexer = Lexer(scanp);

    Token* tok;
    int i = 0;

    cout << "Entering while loop to fetch tags." << endl << endl;
    do{ 
        i++;
        tok = lexer.getToken();
        cout << "The " << i << "th token: " << tok->tag << endl;
    } while(tok->tag != END_OF_FILE);

    return 0;

}

Thank you in advance for your help! :D

Edit:

Here is the input Pascal program:

program Hello;
begin
  writeln ('Hello, world.');
  readln
end.

And to clarify the question:

Why is the tag of the Token retrieved from the symbol table different from the tag of the Token put into the symbol table when the Token in the symbol table is an exact copy of the original?

@tobi303 The question is, when the program token is put in the symbol table, it has the correct tag of 46. But when it is retrieved later, it has a tag of 1207959598, even though the block of memory in the symbol table containing the token is the same as the original token itself. I would like to make the question more clear in the post, what would be a good way to phrase that? — Roland Maio, Oct 08 '15 at 19:27
adding an appropriate tag would help already to avoid getting critics from noobs like me ;) (maybe `compiler-construction` ?) — 463035818_is_not_an_ai, Oct 08 '15 at 19:31
You're adding to void pointers without size spec - I'm fairly sure that shouldn't compile. What compiler & what flags are you using? — dascandy, Oct 08 '15 at 19:33
@tobi303 I'll add the tag, I wasn't sure if I should do that or not before, thanks! — Roland Maio, Oct 08 '15 at 19:40
@dascandy I am using GCC 4.9, and here are the options: -pthread -lrt -Wall -g -std=c++11 — Roland Maio, Oct 08 '15 at 19:44

dascandy · Accepted Answer · 2015-10-08T19:54:25.183

1

Found it. You're overwriting the first byte of your tag with an 'H'. The other bytes are fine. Now to find where that H is coming from...

nextLoc = nextLoc + sizeof(sizeType) + sizeof(unsigned char)
            + idSize*sizeof(char) + value.getSize();

You need to add one more here. You have the skip (sizeType), the length byte (unsigned char), the id itself (idSize * sizeof(char)) and the value (value.getSize()), but you also leave a byte between id and value that you're not accounting for. That's why the last byte of your tag is getting overwritten - and because you're testing on a little-endian machine that results in the highest byte being corrupted.

    for(int i = 0; i < *((unsigned char*)idSize); ++i){
     ...
        tbl_char_ptr = tbl_char_ptr + sizeof(char);
    ...
    }

    result_ptr = tbl_char_ptr + sizeof(char);

That's one more than idSize.

edited Oct 08 '15 at 19:54

answered Oct 08 '15 at 19:27

dascandy

7,184
1
29
50

The SymbolTable is printing the location of every token it examines in its search, the location at 1428768 is the location of the preceding Token, and not the one that is returned which is 1428788. Inspecting the log output from the insertion, not the search, shows that this is the location of the program Token. – Roland Maio Oct 08 '15 at 19:36
Yeah, sorry... I thought that this was the thing you were asking. Your question is, as somebody else also remarked, a bit vague. I'm trying to comb out where the data may be overwritten now... do you happen to also have the input file? – dascandy Oct 08 '15 at 19:39
Yes! I will add the input file, and try to make the question clearer. – Roland Maio Oct 08 '15 at 19:42
1

Updated answer with the actual bug. The programming style makes it very hard to read what you're doing. Consider putting the variable-length thing at the end and using a struct to write to it. Also, this code is *very* prone to bus errors on many computers, because of unaligned access. – dascandy Oct 08 '15 at 19:55
How did you figure out that the first byte was being overwritten with an 'H'? – Roland Maio Oct 08 '15 at 19:56
Using a calculator. 1207959598 in hex is 0x4800002E. In memory that's 0x2E 0x00 0x00 0x48 -> that last byte. – dascandy Oct 08 '15 at 19:57

C++ memcpy copy of object appears corrupted

1 Answers1