-2

I have to write a program that reads a huge file with with operator>>, count things and then insert the whole in multimap.

It works perfectly with small files but with big files the program runs out of memory. It throws a bad_alloc on Windows 32-bit.

I made a 64-bit build on Windows (in order to avoid the memory allocation limit for 32-bit programs on Windows), opened the task manager and what I see is that the memory is increasing continually, and when the RAM is full the program crashes. Do you know how to deal with this ?

Here is the code :

#include <iostream>
#include <string>
#include <fstream>
#include <unordered_map>

using namespace std;

int main()
{
    ifstream fin("C:/text.txt");
    string const outfile("C:/out/out.txt");
    ofstream fout (outfile.c_str());
    string word;
    unordered_multimap<string, int> mymm;
    int intnum = 0;

    while(fin >> word)
    {
        mymm.insert(pair<string, int>(word, intnum));
        ++intnum;
    }

    return 0;
}
bobah
  • 18,364
  • 2
  • 37
  • 70
Learn15773
  • 21
  • 1
  • 7

1 Answers1

1

Modification of divide&conquer will do here:

  1. Dump the map to a file (new file each time) after N lines and clear it. As a side effect you know total number of lines in the file.
  2. Load one file at a time and concatenate all values for keys starting from "a"-"d" into a map. Once all pass-1 files are processed, repeat for "e"-"k", and so on.
  3. You have persistent version of the required map segmented in files ad, ek, ... so if you know the key you can open respective file and find the respective key-values pair (file is sorted so search is logN). Problem solved.

size_t line_number = 0; size_t const fragment_size = 1ul << 19; sorted_index_t segmented_partial_index; // key -> (pos1, pos2, ...) while (fin >> word) { partial_index[key].append(++line_number); // syntax may vary depending on index_t type

    if (!(line_number % fragment_size)) {
        dump_partial_index("segmented/", partial_index);
        segmented_partial_index.clear();
    }

}

if (line_number % fragment_size) {
    dump_partial_index("segmented/", partial_index);
    segmented_partial_index.clear();
}

index_t alphabetical_partial_index;
for (auto range: {"ad", "ek", "lr", "sz"}) {
    for (auto name: partial_index_filenames) {
        // the below loop can be optimized from N to log(N) if needed
        load_partial_index("segmented/", name, segmented_partial_index);
        for (auto x: segmented_partial_index) {
            if (x.first[0] < range[0]) continue;
            if (x.first[0] > range[1]) break;
            alphabetical_partial_index[x.first] += x.second;
        }
    }
    dump_partial_index("alphabetical/", alphabetical_partial_index);
    segmented_partial_index.clear();
    alphabetical_partial_index.clear();
}

bobah
  • 18,364
  • 2
  • 37
  • 70