C++, Reading and Saving large amount of numbers

Question

I have to solve a problem and i need your help, so thanks in advance. As you can see, title says almost everything. There is a file "domes.in" (which is a textfile), which contains 10.000.000 pairs of integers From (1 - 100.000), so in the first part, I have to save all those numbers, in less than 1 sec.

First, I have tried creating an int array of 20.000.000 places, so that i could save the numbers one by one The size of the array is so big that the program does not respond.

Before you read the code, read the task, but please just answer my question and do not post any code, because it is a problem of an online competition and i would like to write it by myself.

#include <fstream>
using namespace std;

int main()
{
    int n, m, a, b;        //N(Value)    M(Pairs)
    int t = 0;
    int r = 0;
    int j;

    ifstream infile("domes.in");
    infile >> n >> m;     //N = 100.000    M = 10.000.000
    int domes[m*2];

    for (j=0; j<m; j++)      //For 1 to 10.000.000
    {
        infile >> a >> b;    //Save 10.000.000 Integers
        domes[t++] = a;
        domes[t++] = b;
    }
    for (j=1; j<=n; j++)          //For J = 1 - 100.000
    {
        int i=0;
        for (int k=0; k<t; k++)   //If the point J is appeared 
            if (domes[k] == j)    //+1 Link
                i++;

        if (i == 1)               //If Links < 2  (of number J)
            r++;                  //+1 Point is connected with less than 2 points
    }                             //Else move on to ++J
    infile.close();

    ofstream outfile("domes.out");
    outfile << r;
    return 0;
}

Now it seems that it might work, but again the program doesn't respond (In runtime, in raises an error, which says that the program stopped running. There are any other build errors or mistakes in the code).

What am i doing wrong?

The TextFile: 100000 10000000

2344 3444

3345 4564

5566 9455 //Random integers up to 100.000

//..........

//Edit: I Removed the struct "DOMES" from the code, since it was the same thing

The Task: I will explain it with my own words, because it is complicated. The are 100.000 points on the map. These points, are connected (There are 10.000.000 links). I have to find the points which have only a single connection (Each Point has at least 1 connection).

For example:

5 4 (N = 5, M = 4) N(Max Value) M(Pairs)

1 2

2 3

4 5

3 4

There are 2 points that have a single connection: 1 and 5 (1 is connected with 2) and (5 is connected with 4). The rest of the points are connected at least twice:

2: 1-2, 2-3

3: 2-3, 3-4

4: 4-3, 4-5

You won't be able to load this amount of data directly, unless using x64 application. What is the task you need to do with this data? Can you go with stream reading? (i.e. reading and processing the file chunk-by-chunk) — DarkWanderer, Dec 27 '13 at 13:52
When you say "up to 100.000 digits", do you mean each integer can be that large? If so, a C `int` isn't going to be big enough to hold one. — Scott Hunter, Dec 27 '13 at 13:57
The numbers can have up to 100,000 digits? If so, they won't fit in an `int`, or any other built-in integral type. — James Kanze, Dec 27 '13 at 13:57
Also: 10 million `DOMES` will take exactly the same space as 20 million `int`. And while the space isn't overwhelming (supposing the numbers really do fit in an `int`), it's probably too much for the stack. Try using `std::vector`. — James Kanze, Dec 27 '13 at 14:00
Actually, you are right. I didn't mean 100.000 digits, i wanted to sat that the integer (the number) can be up to 100.000 — koxliz koxliz, Dec 27 '13 at 14:00
What does the text file look like? The way you are reading `ints` it will stop at all the periods. — Jesse Good, Dec 27 '13 at 14:00
And while I'm at it: `infile >> domes[t].a >> domes[t++].b` is undefined behavior. You're not allowed to modify a variable, and use it elsewhere in the expression. — James Kanze, Dec 27 '13 at 14:01
@JesseGood That actually depends on the locale. (But it is true in the "C" locale, and since he hasn't changed the locale, that's what he's got.) — James Kanze, Dec 27 '13 at 14:02
What about the fact that i have to save those numbers in less than 1 sec? The program does not even respond, not to mention the fact that when i try to open this textfile manually, windows doesn't respond either... — koxliz koxliz, Dec 27 '13 at 14:05
What does the original task look like? May be you don't need to load all the numbers into memory? — Dmitry Markin, Dec 27 '13 at 14:08
10M is not such a horribly large number by any means. You should avoid using the stack for that amount of data, but that is rougly 160M of data (assumes 4bytes per `int`, total 20M of them). — David Rodríguez - dribeas, Dec 27 '13 at 14:15
Do you really need the whole file at once? What actual error do you get when you run this? — doctorlove, Dec 27 '13 at 14:18
Yes, i need the whole file at once... I have already explained what is the error, look at the description: The program stopped working = does not respond — koxliz koxliz, Dec 27 '13 at 14:25
Stupid Competition rules... The program is working fine if you test it with 20-30 pairs and it is done in less than 0.060 seconds (Tested). But 10.000.000 pairs is way too much to be done in 1 sec — koxliz koxliz, Dec 27 '13 at 14:28
@koxlizkoxliz: The requirements don't mention how you solve the problem, only that it needs solving. It is also not clear what you are looking for, those points with no connections or a single connection? or those points with more than 2 connections? are connections directional? does the direction matter? I have the feeling that a single array with 100k entries and a linear algorithm is more than sufficient for your task, while you are attempting to maintain 20M numbers in memory and process in a quadratic algorithm — David Rodríguez - dribeas, Dec 27 '13 at 14:28

score 1 · Answer 1 · answered Dec 27 '13 at 15:26

The first question is where the cost of your algorithm lies, which is most probably not related to the reading of the input. If I understand your algorithm, you are reading all of the numbers (node id) into memory, then you iterate over the links, you pick one node and attempt to find if there are more links that include that node. That algorithm is O(N^2), which for a large number becomes a huge number of operations.

If what you need to know is how many of the nodes have a single connection, all you need to maintain in memory is one entry per node (100k). How much information do you need to keep per node? Little, for example to only detect which nodes have a single connection you can store the number of times the node has been seen (0 -> never seen). Read one line at a time, increment the counts for both nodes (cost: O(N) with N being the number of links). At the end of the process, scan the entries and print out those for which the count is 1 (cost: O(M), where M is the number of nodes). The total cost of the algorithm is linear in the sum of the number of nodes + number of edges, with linear space requirements on the number o nodes.

Compare that with the original approach: space 400k vs. 160M; operations 10M vs. 10^14.

In most programming contests the goal is not to shave the cost of the operations, but to improve the asymptotic complexity of the solution. Naive solutions that seem to work for small inputs won't work for large ones. O(N^2) for a problem of size 10 is basically 100 operations... almost a constant factor! But as the problem increases, with size 100 it becomes 10k, with size 1000 it is 1M...

score 1 · Accepted Answer · answered Dec 27 '13 at 16:18

When it comes to I/O of large data, C++ streams are slower than their C relatives. I am sure you can speed the writing up with fopen() and fwrite() from <cstdio> a lot. By doing this you could write the whole array in one go and probably wouldn't have the crash issue.

score 0 · Answer 3 · answered Dec 27 '13 at 17:18

0

Not sure whether this is will be useful for an online competition. However, for fast access to files I would try memory mapped files. (Use mmap() if your platform supports POSIX, or CreateFileMapping and MapViewOfFile on Windows) Basically you then write to file as you would write to memory.

answered Dec 27 '13 at 17:18

rakeshdn

135
1
7

I ran some tests in the past to decode a number of `int`, and `mmap` surprisingly was not the fastest approach (`fopen` reading chunks of 4K was faster for some reason I cannot really understand) – David Rodríguez - dribeas Dec 27 '13 at 17:53
Thanks for sharing your observations. I found this post on the subject which looks interesting.http://stackoverflow.com/questions/45972/mmap-vs-reading-blocks – rakeshdn Dec 27 '13 at 18:02

C++, Reading and Saving large amount of numbers

3 Answers3