8

I am loading a 10GB file into memory and I find that even if I strip away any extra overhead and store the data in nothing but an array it still takes up 53 GB of ram. This seems crazy to me since I am converting some of the text data to longs which take up less room and convert the rest to char * which should take up the same amount of room as a text file. I have about 150M rows of data in the file I am trying to load. Is there any reason why this should take up so much ram when I load it the way I do below?

There are three files here a fileLoader class and its header file and a main that simply runs them. To answer some questions: OS is UBUNTU 12.04 64bit This is on a machien with 64GB of RAM and an SSD hd that I have providing 64GB of swap space for RAM I am loading all of the data at once becuase of the need for speed. It is critical for the application. All sorting, indexing, and lots of the data intensive work runs on the GPU. The other reason is that loading all of the data at once made it much simpler for me to write the code. I dont have to worry about indexed files, and mappings to locations in another file for example.

Here is the header file:

#ifndef FILELOADER_H_
#define FILELOADER_H_
#include <iostream>
#include <fstream>

#include <fcntl.h>
#include <stdlib.h>
#include <string.h>
#include <string>

class fileLoader {
public:
    fileLoader();
    virtual ~fileLoader();
    void loadFile();
private:
    long long ** longs;
    char *** chars;
    long count;
    long countLines(std::string inFile);
};


#endif /* FILELOADER_H_ */

Here is the CPP file

#include "fileLoader.h"



fileLoader::fileLoader() {
    // TODO Auto-generated constructor stub
    this->longs = NULL;
    this->chars = NULL;
}

char ** split(char * line,const char * delim,int size){
    char ** val = new char * [size];


    int i = 0;
    bool parse = true;
    char * curVal = strsep(&line,delim);
    while(parse){


        if(curVal != NULL){
            val[i] = curVal;
            i++;
            curVal = strsep(&line,delim);
        }else{
            parse = false;
        }

    }

    return val;
}

void fileLoader::loadFile(){
    const char * fileName = "/blazing/final/tasteslikevictory";

    std::string fileString(fileName);
    //-1 since theres a header row and we are skipinig it
    this->count = countLines(fileString) -1;

    this->longs = new long long*[this->count];
    this->chars = new char **[this->count];
    std::ifstream inFile;

    inFile.open(fileName);
    if(inFile.is_open()){
        std::string line;
        int i =0;
        getline(inFile,line);
        while(getline(inFile,line)){
            this->longs[i] = new long long[6];
            this->chars[i] = new char *[7];
            char * copy = strdup(line.c_str());
            char ** splitValues = split(copy,"|",13);

            this->longs[i][0] = atoll(splitValues[4]);
            this->longs[i][1] = atoll(splitValues[5]);
            this->longs[i][2] = atoll(splitValues[6]);
            this->longs[i][3] = atoll(splitValues[7]);
            this->longs[i][4] = atoll(splitValues[11]);
            this->longs[i][5] = atoll(splitValues[12]);

            this->chars[i][0] = strdup(splitValues[0]);
            this->chars[i][1] = strdup(splitValues[1]);
            this->chars[i][2] = strdup(splitValues[2]);
            this->chars[i][3] = strdup(splitValues[3]);
            this->chars[i][4] = strdup(splitValues[8]);
            this->chars[i][5] = strdup(splitValues[9]);
            this->chars[i][6] = strdup(splitValues[10]);
            i++;
            delete[] splitValues;
            free(copy);
        }
    }
}

fileLoader::~fileLoader() {
    // TODO Auto-generated destructor stub
    if(this->longs != NULL){
        delete[] this->longs;
    }

    if(this->chars != NULL){
        for(int i =0; i <this->count;i++ ){
            free(this->chars[i]);
        }
        delete[] this->chars;
    }

}

long fileLoader::countLines(std::string inFile){
    int BUFFER_SIZE = 16*1024;
    int fd = open(inFile.c_str(), O_RDONLY);
    if(fd == -1)
    return 0;

    /* Advise the kernel of our access pattern.  */
    posix_fadvise(fd, 0, 0, 1);  // FDADVICE_SEQUENTIAL

    char buf[BUFFER_SIZE + 1];
    long lines = 0;

    while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))
    {
    if(bytes_read == (size_t)-1)
        return 0;
    if (!bytes_read)
        break;

    for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p)
        ++lines;
    }

    return lines;

}

Here is the file with my main function:

#include "fileLoader.h"

int main()
{

fileLoader loader;
loader.loadFile();
return 0;
}

Here is an example of the data that I am loading:

13|0|1|1997|113|1|4|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
14|0|1|1997|113|1|5|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
15|0|1|1997|113|1|6|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
16|0|1|1997|113|1|7|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
17|0|1|1997|113|1|8|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
18|0|1|1997|113|1|9|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
19|0|1|1997|113|1|10|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
20|0|1|1997|113|1|11|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
21|0|1|1997|113|1|12|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
9|0|1|1997|113|1|13|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
27|0|1|1992|125|1|1|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
28|0|1|1992|125|1|2|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
29|0|1|1992|125|1|3|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
30|0|1|1992|125|1|4|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
31|0|1|1992|125|1|5|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
32|0|1|1992|125|1|6|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
33|0|1|1992|125|1|7|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
34|0|1|1992|125|1|8|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
35|0|1|1992|125|1|9|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
36|0|1|1992|125|1|10|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
37|0|1|1992|125|1|11|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
38|0|1|1992|125|1|12|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
39|0|1|1992|125|1|13|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
40|0|1|1992|125|1|14|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
41|0|1|1992|125|1|15|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
10|0|1|1996|126|1|1||||||
flip
  • 143
  • 1
  • 8
  • How do you measure your memory consumption ? – quantdev Jun 10 '14 at 20:23
  • using htop I am not using any memory management tools yet. I am very new to c++ I am a java developer. So I just see how much ram is consumed but the spike is from 2GB to 56GB when I load the data. – flip Jun 10 '14 at 20:26
  • Why load 10GB into memory - I do not think this is true. There are a lot better ways of processing data – Ed Heal Jun 10 '14 at 20:28
  • 1
    Do you have 53gb of ram? –  Jun 10 '14 at 20:28
  • (it took me a second to get past `char ***` :). It may be that the objects pointed to by all those individual `char**`s (and `char*`s) are not being compacted together very well. – dlf Jun 10 '14 at 20:30
  • 1
    64GB swap plus 64GB in ram. Ed I have loaded it into memory because I can index it very quickly this way which allows me to do things I need to do with the data. Joining different data sets then creating inductive decision trees based on the data and a target. I have had it work as long as i only stick to numbers though I end up using about 64GB of RAN and another 20GB of swap to do this with my current dataset. I need to start analyzing the char * and that will be larger than I can handle on this rig. – flip Jun 10 '14 at 20:30
  • Doesn't need the RAM to do this. Provided it is just for the load, it can all get paged out in any case. It is only if he/she starts to process it that physical ram is an issue. – DrC Jun 10 '14 at 20:31
  • What OS is it ? Can you add it to your question ? – quantdev Jun 10 '14 at 20:32
  • @EdHeal We're not sure. –  Jun 10 '14 at 20:34
  • The OS is ubuntu 12.04 64bit I added it – flip Jun 10 '14 at 20:37
  • Is it really necessary to load everything all at once? Is it possible to implement your program to read and store 1 line at a time? – user3437460 Jun 10 '14 at 20:38
  • The heap chains in this alone are insane. There are considerably more efficient ways of loading this data if the representation of the sample data is accurate. Particularly if significant portions of the data are converted numerics (and it appears they are). – WhozCraig Jun 10 '14 at 20:40
  • Even if this is the case I am more interested in understanding is this how c++ would be expected to work. Is this amount of overhead the expected amount. I have never loaded this much data in java where I am much more proefficient so I want to know. Is this expected?? If so then I will have to deal with it somehow like you say. If I can improve this significantly by fixing some kind of memory optimization I would prefer to do that and be able to keep the speed. – flip Jun 10 '14 at 20:41
  • the ones I am not loading as numbers its because there can be non numeric data in those columns. Again more than solve this particular problem I would like to know if I am doing something horribly wrong in terms of the way I am loading this data – flip Jun 10 '14 at 20:42
  • Lets humour you a little. Consider what you are indexing. It will be either natural language or numbers. Very low on memory. Also why need to load the lot into memory in the first place. New file/record add it to the list. Besides why are you using a `char ***` Does not seem to have any meaning with the others in the class (ie size of those arrays) – Ed Heal Jun 10 '14 at 20:49
  • char *** because first is the row, after that is which value within that row, after that is the char * that contains the text – flip Jun 10 '14 at 20:51
  • @flip - But how long is that row? How long is the columns? Going to run into other records. Besides with all that memory why n ot use vectors and strings – Ed Heal Jun 10 '14 at 20:55
  • This looks more like C than C++, just sayin'... – Macke Jun 17 '14 at 20:12
  • There is no need to copy 10GB of data from file to the operating memory. Instead, MAP the file into the process' memory. The memory-mapping technique is described here: http://en.wikipedia.org/wiki/Memory-mapped_file Basically, you obtain a pointer to the beginning of your file. With this technique, you can process your 10GB file even on a machine with, say, only 2GB or RAM! The access will be slower, but it will work. You can use the above pointer to find out the beginning of each line, the beginning of each string etc. Just use it wisely; ensure sequential access whenever possible. – Oleg Jun 18 '14 at 01:41

2 Answers2

29

You are allocating nine chunks of memory for each line, so you are allocating a total of 1350 million pieces of memory. These allocations have a certain overhead, usually at least twice the size of a pointer, possibly even more. On a 64 bit machine, that is already 16 bytes, so you get 21.6 GB of overhead.

In addition to that, you get the overhead of heap fragmentation and alignment: Even if you only ever store a string in it, the allocator has to align the memory allocations so that you can store the largest possible values in it without triggering misalignment. Alignment may depend on the vector unit of your CPU, which can require very significant alignments, 16 byte alignment not being uncommon.

Doing the calculation with 16 bytes allocation overhead and 16 bytes alignment, we get allocations of 43.2 GB without the original data. With the original data this calculation is already very close to your measurement.

cmaster - reinstate monica
  • 38,891
  • 9
  • 62
  • 106
  • thank you! I think I have an understanding of why this is happeing and what steps I can do to improve my memory consumption (store all the chars together for example). So the biggest problem is like someone else said. It is bad to make lots of small allocations. – flip Jun 10 '14 at 20:48
  • +1 I counted 10, but I started to get dizzy so I may have double-counted something. – WhozCraig Jun 10 '14 at 20:52
3

Each of those objects and strings you create has individual memory management overhead. So you load the string "0" from column 2, depending on your memory manager, it probably takes between two and four full words (could be more). Call it 16 to 32 bytes of storage to hold a one byte string. Then you load the "1" from column 3. And so on.

DrC
  • 7,528
  • 1
  • 22
  • 37
  • That seems like alot of overhead let me get it straight. So each char * and each row long values i store has an 8byte pointer I have to store in addition to the data. That would seem to me that I only need 56byte for the chars and 8 bytes per long. So that would be 9.6GB wouldnt it? – flip Jun 10 '14 at 20:37
  • It is. Allocating very many very small objects is extremely wasteful. Take a look at the first answer in http://stackoverflow.com/questions/13064850/malloc-vs-custom-allocator-malloc-has-a-lot-of-overhead-why. – DrC Jun 10 '14 at 20:41
  • Consider defining a struct that has one field for each column in the file and storing the data as a vector of those structs. And using `char[n]` rather than a dynamically-allocated `char*` for string columns whose values will always be the same length. – dlf Jun 10 '14 at 20:43
  • can you explain quickly why it is that defining a struct for encapsulating this data could be more efficient than these admittedly very silly enormous arrays. I made the arrays because I actually thought they would consume less memory than a class I had implemented which contained it. – flip Jun 10 '14 at 20:46
  • Defining the struct will only help if you can define the storage for the strings inside it (so each string must have a known maximum length). If you do that, you only allocate one large "object" for each line and so the overhead of a couple of words (or possibly just rounding up to a 16 byte boundary) becomes relatives less. – DrC Jun 10 '14 at 20:48
  • @flip It saves you from needing so many extra pointers. If all your strings have a known, fixed length (don't know if that's the case), the only pointer you would need is the one inside the vector. – dlf Jun 10 '14 at 20:48
  • ok that makes complete and total sense. I am very new to c++ and am still understanding these concepts but I get it now. The problem is all th e pointers I need to manage my data. And when I say need i actually mean stupidly thought i did :) – flip Jun 10 '14 at 20:51
  • @flip If the *number* of strings is fixed (and it appears they are), you could potentially save yourself a considerably number of allocations by having a *single* allocation hold the data for all string values, separated by tokens that are modded into terminators. I.e. the first token is the actual base pointer of *the* allocation, the rest are just pointers into *that* allocated buffer. Doing this would remove six allocations per iteration, while allowing you to keep your existing model. This is C++ so I would do most of this differently, but this specific item is a near-freebie. – WhozCraig Jun 10 '14 at 20:59
  • @flip On second thought, it would probably be more efficient (if a bit less natural) to store the data columnwise rather than rowwise. i.e. `long* col4Vals`, etc. You still have substantial overhead if you're forced to resort to `char**` for the string columns, but again that could be overcome if the strings in a given column all have the same length (or you waste less than the allocation overhead on average by imposing one). – dlf Jun 10 '14 at 20:59
  • So I could probably just padd the chars to make them fixed length, that would be less overhead than the pointeres in most cases :) – flip Jun 10 '14 at 21:01
  • @flip You don't necessarily need to pad--if all the strings in some column are either 14 or 15 chars (including the null terminator), for example, you could just use char[15] and have one wasted byte when there are only 14. – dlf Jun 10 '14 at 21:02
  • Yes. Don't forget the 0 that terminates the string when you do the sizing – DrC Jun 10 '14 at 21:02
  • @dlf you mean putting \0 in the empty bytes right? Thats what I wmeant by padding – flip Jun 10 '14 at 21:04
  • @flip It's fine to do that, but you don't need to. Anything that follows the first \0 is effectively invisible, so you might as well leave whatever garbage happens to be there already. – dlf Jun 10 '14 at 21:05
  • @dlf thanks! that will save me some processing time! – flip Jun 10 '14 at 21:11