How to speed-up loading of 15M integers from file stream?

Question

I have an array of precomputed integers, it's fixed size of 15M values. I need to load these values at the program start. Currently it takes up to 2 mins to load, file size is ~130MB. Is it any way to speed-up loading. I'm free to change save process as well.

std::array<int, 15000000> keys;

std::string config = "config.dat";

// how array is saved
std::ofstream out(config.c_str());
std::copy(keys.cbegin(), keys.cend(),
  std::ostream_iterator<int>(out, "\n"));

// load of array
std::ifstream in(config.c_str());
std::copy(std::istream_iterator<int>(in),
  std::istream_iterator<int>(), keys.begin());
in_ranks.close();

Thanks in advance.

SOLVED. Used the approach proposed in accepted answer. Now it takes just a blink.

Thanks all for your insights.

What is `std::array` and why are you using it instead of `std::vector`? — Mike DeSimone, Aug 05 '10 at 13:40
I suggest that OP exactly knows the size of the file, so there's no need in `std::vector`. — Kirill V. Lyadvinsky, Aug 05 '10 at 13:43
std::array is from TR1 extension. But the same effect is with std::vector. I'm using std::array cause it's fixed size, and I know the size at compile time. I'm free to use TR1 extension things. — drumsta, Aug 05 '10 at 13:46
Then why use `std::array keys;` instead of `int keys[15000000];`? — Mike DeSimone, Aug 05 '10 at 13:47
@Mike: `std::array`is in TR1 and C++0x. If the compiler supports it, it is perfectly fine to use. (And provides better type safety than a raw array. It's not more expensive performance-wise. — jalf, Aug 05 '10 at 13:56
Although you should be careful about creating large automatic arrays (of either kind) - that could cause a stack overflow. Vectors will be allocated on the heap. — Mike Seymour, Aug 05 '10 at 14:10
@Mike, Thanks for tip. Could you give some links to read about that, i.e when stack overflow is most likely to occur if I use large arrays? — drumsta, Aug 05 '10 at 14:21
@kriau: this question might be useful: http://stackoverflow.com/questions/1915900/when-do-you-worry-about-stack-size — Mike Seymour, Aug 05 '10 at 14:39
@kriau: in a single-threaded program, you'll probably be alright, depending on the platform. The stack usually grows from one end of the address space, and the heap from the other, and there's no overflow until you've used up the entire address space. With multiple threads, each thread needs a block of address space for its stack, and this usually can't grow - moving a stack would invalidate any pointers to objects on it. Depending on the platform, you can usually specify the stack size when you create a thread, but the default size must be fairly small to allow a large number of threads. — Mike Seymour, Aug 05 '10 at 14:44
There's no way that loading 15M integers from a text file takes 2 minutes. I know that C++ streams have their overhead and can be slow, but they can't be that broken. Is that a network drive you are working on? Or does you hard disk make these silly page trashing noises when you load the data? — Nordic Mainframe, Aug 05 '10 at 15:59
@kriau could you describe the hardware you are using to run this test? I get ~15 seconds for entire read + write using your older code with a ~240MB file on an Intel x64/linux. — carlsborg, Aug 05 '10 at 19:52
I think you'd be interested in the stats found at the end of this article: http://www.codeproject.com/KB/recipes/Tokenizer.aspx — , Aug 05 '10 at 23:45

joshperry · Accepted Answer · 2010-08-05T16:58:12.643

You have two issues regarding the speed of your write and read operations.

First, std::copy cannot do a block copy optimization when writing to an output_iterator because it doesn't have direct access to underlying target.

Second, you're writing the integers out as ascii and not binary, so for each iteration of your write output_iterator is creating an ascii representation of your int and on read it has to parse the text back into integers. I believe this is the brunt of your performance issue.

The raw storage of your array (assuming a 4 byte int) should only be 60MB, but since each character of an integer in ascii is 1 byte any ints with more than 4 characters are going to be larger than the binary storage, hence your 130MB file.

There is not an easy way to solve your speed problem portably (so that the file can be read on different endian or int sized machines) or when using std::copy. The easiest way is to just dump the whole of the array to disk and then read it all back using fstream.write and read, just remember that it's not strictly portable.

To write:

std::fstream out(config.c_str(), ios::out | ios::binary);
out.write( keys.data(), keys.size() * sizeof(int) );

And to read:

std::fstream in(config.c_str(), ios::in | ios::binary);
in.read( keys.data(), keys.size() * sizeof(int) );

----Update----

If you are really concerned about portability you could easily use a portable format (like your initial ascii version) in your distribution artifacts then when the program is first run it could convert that portable format to a locally optimized version for use during subsequent executions.

Something like this perhaps:

std::array<int, 15000000> keys;

// data.txt are the ascii values and data.bin is the binary version
if(!file_exists("data.bin")) {
    std::ifstream in("data.txt");
    std::copy(std::istream_iterator<int>(in),
         std::istream_iterator<int>(), keys.begin());
    in.close();

    std::fstream out("data.bin", ios::out | ios::binary);
    out.write( keys.data(), keys.size() * sizeof(int) );
} else {
    std::fstream in("data.bin", ios::in | ios::binary);
    in.read( keys.data(), keys.size() * sizeof(int) );
}

If you have an install process this preprocessing could also be done at that time...

+1 for noticing why the file size was so big. For the OP, The portability problems can be addressed if the written file needs to be read on another computer, but you may have to convert the data on the way in or on the way out. — Dolphin, Aug 05 '10 at 14:24
To me this is a dead end. It was OK 20 years ago when we wanted to save space on files but the inherent portability issues make this method un-maintainable. With processors advancing quickly it will not be long before this becomes your bottleneck as converting the data from an obsolete binary format to current processors (soon 64 bit will be standard) will make the code much more complex (and slower again (not as slow as streamed text, but slower). — Martin York, Aug 05 '10 at 15:39
Martin, don't get me wrong, I agree completely though I feel that I answered his question with the appropriate caveats. Software development is always about compromise, and compromises should always be approached with the appropriate information and precautions. — joshperry, Aug 05 '10 at 16:44
There's another function what precomputes all integers. It just take 10 minutes to precompute, that's why I store integers in to file. In case of another platform, I always have an option to recompute integers. Thanks for your insigth! — drumsta, Aug 05 '10 at 16:50

Nordic Mainframe · Answer 2 · 2010-08-05T16:25:55.097

Attention. Reality check ahead:

Reading integers from a large text file is an IO bound operation unless you're doing something completely wrong (like using C++ streams for this). Loading 15M integers from a text file takes less than 2 seconds on an AMD64@3GHZ when the file is already buffered (and only a bit long if had to be fetched from a sufficiently fast disk). Here's a quick & dirty routine to prove my point (that's why I do not check for all possible errors in the format of the integers, nor close my files at the end, because I exit() anyway).

$ wc nums.txt
 15000000  15000000 156979060 nums.txt

$ head -n 5 nums.txt
730547560
-226810937
607950954
640895092
884005970

$ g++ -O2 read.cc
$ time ./a.out <nums.txt
=>1752547657

real    0m1.781s
user    0m1.651s
sys     0m0.114s

$ cat read.cc 
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <vector>

int main()
{
        char c;
        int num=0;
        int pos=1;
        int line=1;
        std::vector<int> res;
        while(c=getchar(),c!=EOF)
        {
                if (c>='0' && c<='9')
                        num=num*10+c-'0';
                else if (c=='-') 
                        pos=0;
                else if (c=='\n')
                {
                        res.push_back(pos?num:-num);
                        num=0;
                        pos=1;
                        line++;
                }
                else
                {
                        printf("I've got a problem with this file at line %d\n",line);
                        exit(1);
                }
        }
        // make sure the optimizer does not throw vector away, also a check.
        unsigned sum=0;
    for (int i=0;i<res.size();i++) 
    {
    sum=sum+(unsigned)res[i];
    }
    printf("=>%d\n",sum); 
}

UPDATE: and here's my result when read the text file (not binary) using mmap:

$ g++ -O2 mread.cc
$ time ./a.out nums.txt
=>1752547657

real    0m0.559s
user    0m0.478s
sys     0m0.081s

code's on pastebin:

http://pastebin.com/NgqFa11k

What do I suggest

1-2 seconds is a realistic lower bound for a typical desktop machine for load this data. 2 minutes sounds more like a 60 Mhz micro controller reading from a cheap SD card. So either you have an undetected/unmentioned hardware condition or your implementation of C++ stream is somehow broken or unusable. I suggest to establish a lower bound for this task on your your machine by running my sample code.

score 6 · Answer 3 · answered Aug 05 '10 at 13:37

6

if the integers are saved in binary format and you're not concerned with Endian problems, try reading the entire file into memory at once (fread) and cast the pointer to int *

answered Aug 05 '10 at 13:37

Steven A. Lowe

60,273
18
132
202

You can use the file size as a basis for allocating the array. – EvilTeach Aug 05 '10 at 13:39
Just for my curriosity, does it mean that I can't speed-up loading using pure C++ STL approach? – drumsta Aug 05 '10 at 13:42
7

You could even `mmap` the file and cast the mmap as `int*`. – kennytm Aug 05 '10 at 13:46
1

Yep, `mmap` is the only sane answer here. – Nikolai Fetissov Aug 05 '10 at 13:51
You should mmap anyway for you may have benefits provided by the OS, if the file is still in cache. – mvds Aug 05 '10 at 13:52
@kriau: you can still use STL, you just shouldn't convert to/from string representations. – jalf Aug 05 '10 at 13:57
1

The advantages of `mmap` are that it will save copying the data into a user buffer (60MB would probably take a few milliseconds to copy), and that you can modify the data in place without having to explicitly write it back to the file. The disadvantages are that it isn't portable, and you'd need to find or write a RAII wrapper to use it safely. – Mike Seymour Aug 05 '10 at 14:28
I was also going to suggest 'use C' ;) – sje397 Aug 05 '10 at 15:30
One of the other advantages of `mmap()` is that it will only load the parts of the file that you touch. If you touch only 10 pages of those 60 MB, you'll have 10 I/O's. – MSalters Aug 06 '10 at 08:26

score 6 · Answer 4 · answered Aug 05 '10 at 14:01

6

You could precompile the array into a .o file, which wouldn't need to be recompiled unless the data changes.

thedata.hpp:

static const int NUM_ENTRIES = 5;
extern int thedata[NUM_ENTRIES];

thedata.cpp:

#include "thedata.hpp"
int thedata[NUM_ENTRIES] = {
10
,200
,3000
,40000
,500000
};

To compile this:

# make thedata.o

Then your main application would look something like:

#include "thedata.hpp"
using namespace std;
int main() {
  for (int i=0; i<NUM_ENTRIES; i++) {
    cout << thedata[i] << endl;
  }
}

Assuming the data doesn't change often, and that you can process the data to create thedata.cpp, then this is effectively instant loadtime. I don't know if the compiler would choke on such a large literal array though!

answered Aug 05 '10 at 14:01

Aaron McDaid

26,501
9
66
88

Very interesting approach. In my case precomputed data doesn't change at all. Will think about your idea. – drumsta Aug 05 '10 at 14:09
1

This could cause extreme memory usage when you try to compile the file. I remember ending in compiler crash with gcc and ms visual when using this approach... – tibur Aug 05 '10 at 14:34
@tibur: agreed. It's probably better to create an assembler file with the representation of the array. – Nordic Mainframe Aug 05 '10 at 16:41

score 3 · Answer 5 · answered Aug 05 '10 at 14:01

3

Save the file in a binary format.

Write the file by taking a pointer to the start of your int array and convert it to a char pointer. Then write the 15000000*sizeof(int) chars to the file.

And when you read the file, do the same in reverse: read the file as a sequence of chars, take a pointer to the beginning of the sequence, and convert it to an int*.

of course, this assumes that endianness isn't an issue.

For actually reading and writing the file, memory mapping is probably the most sensible approach.

answered Aug 05 '10 at 14:01

jalf

243,077
51
345
550

That's the exactly the same what I'm thinking about right now. Only one thing from implemetation perspective is not so good, that I should stick with std::array internals, i.e. I should assume std:array has implemented C style array structure internally to store data. But that's fine with me. – drumsta Aug 05 '10 at 14:16
1

If you are going to write binary. At least convert it into binary format that is easy to move between architectures. See htonl() and family http://publib.boulder.ibm.com/infocenter/zvm/v5r4/index.jsp?topic=/com.ibm.zvm.v54.edclv/htonl.htm – Martin York Aug 05 '10 at 15:44
@Martin York: I'm assuming that 1. he doesn't need to move the data between architectures, and 2. he needs maximum performance. Calling `htonl` and its friends is going to be needless overhead if you don't need it. But it's a good point if portability is needed. – jalf Aug 05 '10 at 15:52
Any application that last more than 1.5 years (or is Amdals law not relavant any more) will have to worry about portability (I emphasse worry (it may not change every 1.5 years but you need to worry it will)). But then I am old and crumgany and have been forced to re-write code that was easy when you made assumptions. :-) – Martin York Aug 05 '10 at 15:58
the overhead of swapping the bytes is likely overwhelmed by the IO. You'd need to profile but I suspect the portability hit is small for the swapping. – Paul Rubel Aug 05 '10 at 16:19

score 1 · Answer 6 · answered Aug 05 '10 at 14:44

If the numbers never change, preprocess the file into a C++ source and compile it into the application.

If the number can change and thus you have to keep them in separate file that you have to load on startup then avoid doing that number by number using C++ IO streams. C++ IO streams are nice abstraction but there is too much of it for such simple task as loading a bunch of number fast. In my experience, huge part of the run time is spent in parsing the numbers and another in accessing the file char by char.

(Assuming your file is more than single long line.) Read the file line by line using std::getline(), parse numbers out of each line using not streams but std::strtol(). This avoids huge part of the overhead. You can get more speed out of the streams by crafting your own variant of std::getline(), such that reads the input ahead (using istream::read()); standard std::getline() also reads input char by char.

score 0 · Answer 7 · answered Aug 05 '10 at 13:39

0

Use a buffer of 1000 (or even 15M, you can modify this size as you please) integers, not integer after integer. Not using a buffer is clearly the problem in my opinion.

answered Aug 05 '10 at 13:39

INS

10,594
7
58
89

The streams are already buffered, and there are probably more buffers in the OS. Adding another layer of buffering is unlikely to speed it up. – Mike Seymour Aug 05 '10 at 14:12

score 0 · Answer 8 · edited May 23 '17 at 11:48

0

If the data in the file is binary and you don't have to worry about endianess, and you're on a system that supports it, use the mmap system call. See this article on IBM's website:

High-performance network programming, Part 2: Speed up processing at both the client and server

Also see this SO post:

When should I use mmap for file access?

edited May 23 '17 at 11:48

Community

1
1

answered Aug 05 '10 at 14:28

Robert S. Barnes

39,711
30
131
179

How to speed-up loading of 15M integers from file stream?

8 Answers8

Attention. Reality check ahead:

What do I suggest