4

I've got the following code below which parses a text file and indexes the words and lines:

bool Database::addFromFileToListAndIndex(string path, BSTIndex* & index, list<Line *> & myList)
{
    bool result = false;
    ifstream txtFile;
    txtFile.open(path, ifstream::in);
    char line[200];
    Line * ln;
    //if path is valid AND is not already in the list then add it
    if(txtFile.is_open() && (find(textFilePaths.begin(), textFilePaths.end(), path) == textFilePaths.end())) //the path is valid
    {
        //Add the path to the list of file paths
        textFilePaths.push_back(path);
        int lineNumber = 1;
        while(!txtFile.eof())
        {
            txtFile.getline(line, 200);
            ln = new Line(line, path, lineNumber);
            if(ln->getLine() != "")
            {
                lineNumber++;
                myList.push_back(ln);
                vector<string> words = lineParser(ln);
                for(unsigned int i = 0; i < words.size(); i++)
                {
                    index->addWord(words[i], ln);
                }
            }
        }
        result = true;
    }
    return result;
}

My code works flawlessly and fairly quickly until I give it a HUGE text file. Then I get a stack overflow error from Visual Studio. When I switch to "Release" configuration the code runs without a hitch. Is there something wrong with my code or is there some kind of limitation when running the "Debug" configuration? Am I trying to do too much in one function? If so how can I break it up so it doesn't crash while debugging?

EDIT Per request, my implementation of addWord;

void BSTIndex::addWord(BSTIndexNode *& pCurrentRoot, string word, Line * pLine)
    {
        if(pCurrentRoot == NULL)  //BST is empty
        {
            BSTIndexNode * nodeToAdd = new BSTIndexNode();
            nodeToAdd->word = word;
            nodeToAdd->pData = pLine;
            pCurrentRoot = nodeToAdd;
            return;
        }
        //BST not empty
        if (word < (pCurrentRoot->word)) //Go left
        {
            addWord(pCurrentRoot->pLeft, word, pLine);
        }
        else //Go right
        {
            addWord(pCurrentRoot->pRight, word, pLine);
        }
    }

And lineParser:

vector<string> Database::lineParser(Line * ln) //Parses a line and returns a vector of the words it contains
{
    vector<string> result;
    string word;
    string line = ln->getLine();
    //Regular Expression, matches anything that is not a letter, number, whitespace, or apostrophe
    tr1::regex regEx("[^A-Za-z0-9\\s\\']");
    //Using regEx above, replaces all non matching characters with nothing, essentially removing them.
    line = tr1::regex_replace(line, regEx, std::string(""));

    istringstream iss(line);
    while(iss >> word)
    {
        word = getLowercaseWord(word);
        result.push_back(word);
    }
    return result;
}
Pete
  • 10,651
  • 9
  • 52
  • 74
  • 3
    It's unrelated to the question, but it appears there is a leak for empty lines. The code creates a `new Line(...)` object and appears to never delete it if the line is empty. – Mark Wilkins Apr 14 '11 at 23:48
  • @sarnold No, that's not the cause of the stack overflow. He was just pointing it out which is why he commented instead of answering. – Pete Apr 15 '11 at 00:03
  • @Pete: How HUGE is your HUGE text file? If you're just within the limits of available address space in Release, since Debug likely adds some stuff in memory (stack guards, ...) you might go above the limit and then crash. – Romain Apr 15 '11 at 00:12
  • @Romain I don't think it's that big to be honest. It's 530KB on disk, has about 100,000 words and around 10,000 lines give or take. If the problem is that I'm trying to process too much info how can I split up the work? – Pete Apr 15 '11 at 00:15
  • 1
    You may also be crushing your stack and only get it detected in debug thanks to the stack guards (not sure how VS handles that). Would be cool if you could check it out with stuff like valgrind. – Romain Apr 15 '11 at 00:19
  • can you post your implementation of addWord? and lineParser? – Vusak Apr 15 '11 at 00:22
  • Have you tried to change stack size and see what happening? By the way, allocation of arrays in stack can lead to such errors (char line[200]; in your code). But here the size of the array is small and if the function is called only once, it is very unlikely to cause overflow. – fdermishin Apr 15 '11 at 00:48

4 Answers4

5

A stack overflow indicates that you've run out of stack space (probably obvious, but just in case). Typical causes are non-terminating or excessive recursion, or very large stack object duplication. Funnily enough it might be either in this case.

It's likely that in Release your compiler is doing tail call optimization which inhibits stack overflow from excessive recursion.

It's also likely that in Release your compiler is optimizing the return copy of the vector from lineParser.

So you need to find out which condition is overflowing in Debug, I would start with the recursion as the most likely culprit, trying changing the string parameter type to a reference, ie.

void BSTIndex::addWord(BSTIndexNode *& pCurrentRoot, string & word, Line * pLine)

This should stop you from duplicating word object on each nested invocation of addWord.

Also consider adding a std::cout << "recursing addWord" << std::endl; type statement to addWord so that you can see how deep its going and if its terminating correctly.

Vusak
  • 1,420
  • 9
  • 12
  • Yep, definitely is a tail-call. – Ben Voigt Apr 15 '11 at 01:12
  • I think the recursive call is the problem for certain. I took your advice and put a static counter inside of addWord. It increments before every recursive call and then prints out and resets when the function is ended. The file is still parsing but so far I am up to 1500+ recursive calls when adding some words and that number will only get larger as the size of the BST increases. Should I not worry about the stack overflow when I'm in "Debug" config since it works OK in "Release" config? – Pete Apr 15 '11 at 05:21
  • 3,469 calls deep and I got the stack overflow. – Pete Apr 15 '11 at 05:28
  • @Pete I think you should try to avoid the stack overflow both in Debug and Release, simply because it will complicate debugging other issues. Chris Dodd provides a better alternative for implementation, unless you absolutely must use recursion. Meanwhile I would also recommend having another look at your data structure and what you want it to achieve - maybe a hashmap or Trie could work better? And it looks like you're duplicating word entries across multiple nodes (this could be a shared pointer to a word instead). – Vusak Apr 15 '11 at 07:09
  • 1
    Since I accepted this answer I wanted to say that I ended up actually changing the data structure and left the recursive call (which was causing the stack overflow) in tact. The root of my problem I think is that I was storing each instance of a word. The data structure now only stores unique words and each word has a pointer to a set of lines. This actually improves my code in addition to removing the excessive recursion. – Pete Apr 16 '11 at 03:35
3

The problem is almost certainly the recursive call in addWord -- in a non-optimized build this will consume lots of stack space, while in an optimized build, the compiler will turn it into a tail call, which reuses the same stack frame.

You could manually transform the recursive call into a loop pretty easily:

void BSTIndex::addWord(BSTIndexNode ** pCurrentRoot, string word, Line * pLine)
{
    while (*pCurrentRoot != NULL) {
        //BST not empty
        if (word < (*pCurrentRoot)->word) //Go left
        {
            pCurrentRoot = &(*pCurrentRoot)->pLeft;
        }
        else //Go right
        {
            pCurrentRoot = &(*pCurrentRoot)->pRight;

        }
    }
    //BST is empty
    BSTIndexNode * nodeToAdd = new BSTIndexNode();
    nodeToAdd->word = word;
    nodeToAdd->pData = pLine;
    *pCurrentRoot = nodeToAdd;
}
Chris Dodd
  • 119,907
  • 13
  • 134
  • 226
  • The advice is sound, the implementation is lacking. I know that the original implementation was lacking too, but it's not a reason, in a proposed solution, for not using proper smart pointers where necessary. Because the next question of the OP (or any that come after him) will be: *Eh! I implemented Chris' solution and I've got a memory leak, Plz Help!*; I understand the willingness for short example, but good coding practices should not be forgone in the name of *brevity* or *simplicity*, as it's a loss in the long run. – Matthieu M. Apr 15 '11 at 07:16
1

You should post your stack also, that'd actually show what caused the overflow. It looks fairly obvious that the recursion in addWord is significantly consuming stack memory.

If you just want to have it work, go into your compiler/linker settings and increase the size reserved for your stack. By default it's only 1MB, crank it up to 32MB or something and you'll be assured whatever extra counter's or probe's the debuging build has, you'll have enough stack to handle it.

RandomNickName42
  • 5,923
  • 1
  • 36
  • 35
  • Yeah, the last thing in the stack that's mine are all calls (lots of calls) to addWord. I was doing some testing and the amount of addWord calls in the stack increases as I add more words. How can I mitigate that while keeping the function recursive? – Pete Apr 15 '11 at 00:48
  • Tail recursion, see here -> http://stackoverflow.com/questions/2693683/tail-recursion-in-c – RandomNickName42 Apr 15 '11 at 01:09
0

You can increase the size of the stack to appropriate number of bytes.

#pragma comment(linker, "/STACK:1000000000")  
Temak
  • 2,929
  • 36
  • 49