-1

I've recently encountered a problem in c++ object creation. The problem is somewhat like it in question C++ strange segmentation fault by object creation, however the codes here are part of an open source project and may not have easy errors.

The object creation is called in a method and the method is called in two continuous steps.

The class is defined in strtokenizer.h as follows:

class strtokenizer {
    protected:
        vector<string> tokens;
        int idx;
    public:
        strtokenizer(string str, string seperators = " ");
        void parse(string str, string seperators);
        int count_tokens();
        string next_token();   
        void start_scan();
        string token(int i);
};

And in strtokenizer.cpp, it is like this:

using namespace std;
strtokenizer::strtokenizer(string str, string seperators) {
    parse(str, seperators);
}
void strtokenizer::parse(string str, string seperators) {
    int n = str.length();
    int start, stop;
    if (flag) {
        printf("%d\n", n);
    }
    start = str.find_first_not_of(seperators);
    while (start >= 0 && start < n) {
        stop = str.find_first_of(seperators, start);
        if (stop < 0 || stop > n) {
            stop = n;
        }
        tokens.push_back(str.substr(start, stop - start));
        start = str.find_first_not_of(seperators, stop + 1);
    }
    start_scan();
}
int strtokenizer::count_tokens() {
    return tokens.size();
}
void strtokenizer::start_scan() {
    idx = 0;
    return;
}
string strtokenizer::next_token() {
    if (idx >= 0 && idx < tokens.size()) {
        return tokens[idx++];
    } else {
        return "";
    }
}
string strtokenizer::token(int i) {
    if (i >= 0 && i < tokens.size()) {
        return tokens[i];
    } else {
        return "";
    }
}

The method that create the strtokenizer objects is as follows:

int dataset::read_wordmap(string wordmapfile, mapword2id * pword2id) {
    pword2id->clear();
    FILE * fin = fopen(wordmapfile.c_str(), "r");
    if (!fin) {
        printf("Cannot open file %s to read!\n", wordmapfile.c_str());
        return 1;
    }
    char buff[BUFF_SIZE_SHORT];
    string line;
    fgets(buff, BUFF_SIZE_SHORT - 1, fin);
    int nwords = atoi(buff);
    for (int i = 0; i < nwords; i++) {
        fgets(buff, BUFF_SIZE_SHORT - 1, fin);
        line = buff;
        strtokenizer strtok(line, " \t\r\n");
        if (strtok->count_tokens() != 2) {
            continue;
        }
        pword2id->insert(pair<string, int>(strtok->token(0), atoi(strtok->token(1).c_str())));
    }
fclose(fin);
return 0;

}

When the read_wordmap() method is run for the first time (first read_wordmap() call), the 'strtok' object is created about 87k times and in the second time (second read_wordmap() call), the oject is expected to be run for more than 88k times. However, it will raise a error (sometime 'segmentation fault' and sometimes 'memory corruption (fast)') at about 86k times in the second method call, at the line:

strtokenizer strtok(line, " \t\r\n");

And when the code block of object creation is revised like those below, there will be no errors.

strtokenizer *strtok = new strtokenizer(line, " \t\r\n");
printf("line: %s", line.c_str());
if (strtok->count_tokens() != 2) {
    continue;
}
pword2id->insert(pair<string, int>(strtok->token(0), atoi(strtok->token(1).c_str())));
Community
  • 1
  • 1
YongYoung
  • 21
  • 3
  • 2
    Welcome to Stack Overflow! It sounds like you may need to learn how to use a debugger to step through your code. With a good debugger, you can execute your program line by line and see where it is deviating from what you expect. This is an essential tool if you are going to do any programming. Further reading: [How to debug small programs](http://ericlippert.com/2014/03/05/how-to-debug-small-programs/). – Paul R May 10 '16 at 19:33
  • Thanks for your suggestion.I know how to use debuggers and I also wrote in which line the error occurred. Do I need to provide more debugging informations? – YongYoung May 10 '16 at 19:40
  • When you get a crash then ideally you should do a `bt` (back-trace) and look at the state of relevant local variables etc. – Paul R May 10 '16 at 19:43
  • @PaulR _`bt`_ => back trace?? – πάντα ῥεῖ May 10 '16 at 19:44
  • @πάνταῥεῖ: yes, `bt` is the command for backtrace in gdb. – Paul R May 10 '16 at 19:45
  • @PaulR Well, I rarely use the GDB cli directly :-P ... – πάντα ῥεῖ May 10 '16 at 19:47
  • @πάνταῥεῖ: nor me, or not so much these days anyway, as `lldb` is what all the cool kids use now. ;-) – Paul R May 10 '16 at 19:49
  • If I had to guess, I would say your problem is here.. `pair(strtok->token(0), atoi(strtok->token(1).c_str())` . Might be wise to wrap that `string(strtok->token(0))` – doog abides May 10 '16 at 19:55
  • I cleaned up and compiled your code, but it has no problem for my 9 liner test input. I suspect your problem is rather in `mapword2id` class. Also the code with `new` operator will cause memory leak because there is a missing `delete` before `continue`. – Ivan Marinov May 10 '16 at 20:02
  • To be clear, all what `read_wordmap`, `strtokenizer` and its `parse` method are supposed to do is reading n lines from a file (n being read from the file too), extracting a string (a single word as ' ' is a delimiter) and an int from that line and insert them as a pair in a `mapword2id` container (whatever it is)? In c++? – Bob__ May 10 '16 at 20:20
  • 1
    Also tested with 2 files, each 95k lines, no issues. If you post your input files on pastebin.com I can check. Anyway, I strongly suggest to write unit tests for your code :) – Ivan Marinov May 10 '16 at 20:46
  • Thanks for all your helps! I debugged the code in VS and found the errors occurred in code lines are quite different from where they occur in linux. I write them down below to finish my question page. Hope you won't blame me for wasting your time! – YongYoung May 12 '16 at 17:23

2 Answers2

1

It look like you have a memory corruption in your code. You should consider using a tool like valgrind (http://valgrind.org/) to check that the code does not write out of bounds.

Your revised code use heap memory instead of stack memory, which may hide the problem (even if it still exists).

By reading your code, there is several missing tests to ensure safe handling in case the provided wordmapfile has some unexpected data. For example you do not check the result of fgets, so if the number of words at the begining of the file is bigger than the real number of words, you will have issues.

PierreL
  • 63
  • 4
  • If the number in first line of the input file is bigger than the number of lines in the input file, last line will be inserted to pword2id multiple times. – Ivan Marinov May 10 '16 at 20:37
0

I carefully debugged my code under the suggestion of @Paul R and other friends and found it is because I haven't free memory in stack. The codes proposed above are tiny parts of my project, and in the project a gibbs sampling algorithm is supposed to run for one thousand times(iterations).

In each iteration, old matrixes are supposed to be freed and new ones are to be "newed out". However, I forgot to free all the matrix and lists, and that's why my program corrupts.

The reason why I posted codes above is that the program will crash every time when it ran into the line:

strtokenizer strtok(line, " \t\r\n");

The object "strtok" will be run for 1000 * lines in files(with 10000+ lines). So it made me think maybe there are too many objects created and take up all of the stack memory. Even though I found there are no need to manually free them.

When debugged the program in visual studio, the monitor of memory occupancy showed a dramatic growth in each iteration and "bad_alloc" error took place every now and then. These made me realize that I forget to free some large dynamic matrix.

Thanks for you all! And I apologise for the wrongly described question that takes up your time!

YongYoung
  • 21
  • 3