-3

I have data file of 36MB(each value in file is double type) residing on hard disk. My question is, when I read this file via c++ in RAM putting content in matrix (provided by boost library), does it going to occupy only 36MB of RAM or different? Am I running out of memory?

The reason is that I am on 64-bit ubuntu platform with 8 GB RAM and I am getting bad allocation error. The same file reading program works fine for small data files.

Below is snippet to load the (data real-sim )[https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html] . x and y are boost matrix and vector respectively declared as extern in .h file.

void load_data(const char* filename)
{
   ifstream in(filename);
   string line;
    int line_num = 0;
    if (in.is_open()) {
        while (in.good()) {
            getline(in, line);
            if (line.empty()) continue;
            int cat = 0;
            if (!parse_line(line, cat, line_num)) {
                cout << "parse line: " << line << ", failed.." << endl;
                continue;
            }

            y(line_num) = cat;

            line_num += 1;
        }
        in.close();
    }
 }

bool debug = false;
using namespace boost::numeric::ublas;
vector<double> y(no_records);
matrix<double> x(no_records,no_features);
using namespace std;

template < class T>
void convert_from_string(T& value, const string& s)
{
stringstream ss(s);
ss >> value;
}

int get_cat(const string& data) {
int c;
convert_from_string(c, data);

return c;
} 


bool get_features(const string& data, int& index, double& value) {
int pos = data.find(":");
if (pos == -1) return false;
convert_from_string(index, data.substr(0, pos));
convert_from_string(value, data.substr(pos + 1));

return true;
}


bool parse_line(const string& line, int& cat, const int line_num) {
if (line.empty()) return false;
size_t start_pos = 0;
char space = ' ';

while (true) {
    size_t pos = line.find(space, start_pos);

    if ((int)pos != -1) {
        string data = line.substr(start_pos, pos - start_pos);
        if (!data.empty()) {
            if (start_pos == 0) {
                cat = get_cat(data);
            }
            else {
                int index = -1;
                double v = 0;
                get_features(data, index, v);
                if (debug)
                    cout << "index: " << index << "," << "value: " << v << endl;
                if (index != -1) {
                    index -= 1; // index from 0
                    x(line_num, index) = v;
                }
            }
        }
        start_pos = pos + 1;
    }
    else {
        string data = line.substr(start_pos, pos - start_pos);
        if (!data.empty()) {
            cout << "read data: " << data << endl;
            int index = -1;
            double v = 0;
            get_features(data, index, v);
            if (debug)
                cout << "index: " << index << "," << "value: " << v << endl;
            if (index != -1) {
                index -= 1; // index from 0
                x(line_num, index) = v;
            }
        }
        break;
    }
}

return true;
}
CKM
  • 1,911
  • 2
  • 23
  • 30
  • 3
    It would occupy around the same amount yes. The crash/exception you get *might* be related to the size, but it doesn't have to be, but it's impossible to say anything without more information, preferably some code. Please try to create a [Minimal, Complete, and Verifiable Example](http://stackoverflow.com/help/mcve) and show us, together with a small sample of the input file (if it's text). – Some programmer dude Oct 11 '15 at 16:53
  • 3
    You have made an assumption that is almost sure to be invalid. 36MB is nothing and the file size is not a direct contributor to your bug/s. – Martin James Oct 11 '15 at 16:53
  • I just ran my program on a file (42MB on hdd) that contains 1000 rows and 5K columns(features). It ran perfectly. But, when I tried running the same code with data file containing 70K rows and 20K columns, bad allocation error occurred (also on data file of size 10K X 1M). This indicates that file size is not issue. But, can you guess why that error is occurring? – CKM Oct 11 '15 at 17:05
  • 1
    Guessing is not a useful tool in problem solving. Again, what is your program doing? Where does it allocate memory? and how? etc. – underscore_d Oct 11 '15 at 17:15
  • Perhaps when you load the data into a matrix, you are consuming more space than you need to for each feature. Without example code, we can only speculate. – Kevin Oct 11 '15 at 17:16
  • @underscore_d. If you please read the question carefully, I have said above that my program reads a data file in a boost matrix. FYI, I have added code. – CKM Oct 11 '15 at 17:29
  • 2
    Don't blame my reading comprehension when I had nothing useful to read. Congratulations on now achieving the latter. – underscore_d Oct 11 '15 at 17:42
  • @underscore_d. I am not blaming. As my initial question said that I got bad allocation error. I thought it may be huge amount of memory taken in RAM after file reading. To ans. to questions raised in your first comment, useful thing that could be found in my initial Q are 1. My program is reading a file from hdd to RAM using C++ 2. I used boost matrix to allocate memory 3. If you are familiar how boost allocate memory to matrices, you should know how part of your question. Also as I said above that my program is working fine on small data file and so I thought its not programming issue. – CKM Oct 12 '15 at 03:55
  • Unrelated to your problem, but you might want to read ["Why is “while ( !feof (file) )” always wrong?"](http://stackoverflow.com/questions/5431941/why-is-while-feof-file-always-wrong). While the question is specific to C the accepted answer is generic for just about all languages, especially C and C++. And it's not only `while (!in.eof())` that's bad, also the equivalent `while (in.good())` suffers from the same problem. – Some programmer dude Oct 12 '15 at 06:47
  • More related to your problem, when you define the global `x` and `y` variables, what is the values of `no_records` and `no_features`? If they are to low you will probably be writing out of bounds, and if they're to high you're wasting a lot of memory. – Some programmer dude Oct 12 '15 at 06:53

1 Answers1

0

I found the culprit. The reason for bad allocation error was that I was running out of memory. The thing is that I was using dense matrix representation (as provided by boost library). As such, storing a matrix of size 20000x40000 as dense matrix in boost matrix representation will require RAM of size 6.4GB. Now, if one don't have that much space in RAM, bad allocation is going to pop-up.

CKM
  • 1,911
  • 2
  • 23
  • 30