2

So I have a computationally heavy c++ function that extracts numbers from a file and puts them into a vector. When I run this function in main, it takes a lot of time. Is it possible to somehow have this function computed once, and then linked to the main program so I can save precious computation time in my main program every time I try to run it?

The function I have is this:

vector <double> extract (vector <double> foo)
{
    ifstream wlm;
    wlm.open("wlm.dat");

    if(wlm.is_open())
    {
        while (!wlm.eof())
        {
            //blah blah extraction stuff
        }
        return foo;      
    }
    else 
        cout<<"File didn't open"<<endl;
    wlm.close();
}

And my main program has other stuff which I compute over there. I don't want to call this function from the main program because it will take a long time. Instead I want the vector to be extracted beforehand during compile time so I can use the extracted vector later in my main program. Is this possible?

pyroscepter
  • 205
  • 1
  • 3
  • 9
  • 1
    Why are you asking us? Buy yourself a stop watch and measure. – Kerrek SB Jun 06 '16 at 08:36
  • Data that you read from file is changed from run to run? Or it is static? – Alex Jun 06 '16 at 08:38
  • _"Also, is this function taking a long time during compilation or while being executed?"_ You're the only one who can tell us that. Your question confuses the two, several times, to the point of making it unanswerable. – Lightness Races in Orbit Jun 06 '16 at 08:39
  • I'm failry sure you want to reduce *run*time not *compile* time. The latter is the time spent when building the program. Btw you don't need wlm.close(), this is C++ – stijn Jun 06 '16 at 08:39
  • 1
    _"Is it possible to somehow have this function compile once, and then linked to the main program"_ That's literally how C++ works already. As such, this question appears to be a (broken) XY problem. – Lightness Races in Orbit Jun 06 '16 at 08:39
  • @LightnessRacesinOrbit problem is I don't want to run this function with the main program because the file size is huge (a few gigs). Is there a way I can run this function once, get the vector, then use this vector with the main c++ program? This is what I meant to ask. Sorry for not being specific enough. – pyroscepter Jun 06 '16 at 08:41
  • So you want to _precompute_ the data and ship your program with the computed result, rather than having it compute the result itself? Sure, simply do that. Process the text file itself, separately. – Lightness Races in Orbit Jun 06 '16 at 08:42
  • Why is pulling numbers out of a file and putting them in a vector computationally heavy? That should require almost no computation at all. Why not just fix/optimize your code so it doesn't do unnecessary work? – David Schwartz Jun 06 '16 at 08:44
  • Do you mean run it in a separate process, compute the vector then serialise the vector to disk so that your main program can just read in the vector without recomputing the calculation? Or do you mean run it in the background as your program starts, or run it first in your program? – Rup Jun 06 '16 at 08:44
  • Seems like you want to include a 'resource' of several gb into your program? Well, when starting that program it will be read from disk and put into memory. That's also what your program does now already, so no you cannot make that faster somehow. – stijn Jun 06 '16 at 08:44
  • @pyroscepter, I tried to change some terminology in the question so it better reflects what (I think) you are after. Can you confirm I didn't change what you meant? – Tamás Szelei Jun 06 '16 at 08:44
  • Looking at your read (example) you should definitively read [why is `while(!eof)` always wrong](http://stackoverflow.com/questions/5431941/why-is-while-feof-file-always-wrong) – BeyelerStudios Jun 06 '16 at 08:49
  • @ stjin So you are saying it's impossible to do what I suggested at compile time? If it were possible I could just link the two object files together right? That's what I am after I think. To compute this at compile time. So I can link it with the main program saving the compile time each time. – pyroscepter Jun 06 '16 at 08:51
  • @BeyelerStudios Thanks! So I'm extracting into int using wlm >> var; So what should I use instead? – pyroscepter Jun 06 '16 at 08:59
  • @pyroscepter If the limiting factor is the time it takes to read the file, you should look for ways to store the data in the smallest possible file. Storing the numbers in text form is definitely sub-optimal. Also, reading it naively into a vector is horribly inefficient too. Why are you doing that? – David Schwartz Jun 06 '16 at 09:04

2 Answers2

2
  1. Change your function to that:

    std::vector<double>& extract(std::vector<double>& foo)
    

So you will not copy vector twice (I guess that eats most of time).

  1. Try to reserve() memory for your vector according to file data (if that is possible, that will let you avoid reallocations).
  2. You should return std::vector<double> always, not just in case of good result.
  3. You should close file just if it was successfully opened.

Something like that:

    std::vector<double>& extract (std::vector<double>& foo)
    {
        ifstream wlm;
        wlm.open("wlm.dat");

        if(wlm.is_open())
        {
            while (!wlm.eof())
            {
            //blah blah extraction stuff
            }
            wlm.close();
        }
        else 
            cout<<"File didn't open"<<endl;

        return foo;      
    }
Arkady
  • 2,084
  • 3
  • 27
  • 48
  • This is fine, I thought of doing this (and will make appropriate changes) but problem is I don't want to run this function with the main program because the file size is huge (a few gigs). Is there a way I can run this function once, get the vector, then use this vector with the main c++ program? This is what I meant to ask. Sorry for not being specific enough. – pyroscepter Jun 06 '16 at 08:40
  • As in persisting across multiple invocations of the program? No, even if you were to add the data directly to the compiled executable, you'd still have to construct the vector from the data, and this step is likely disk I/O bound instead of CPU bound. – OmnipotentEntity Jun 06 '16 at 08:42
  • @pyroscepter If the program needs a few gigs of data, then it has to get that data from someplace, no? By storing the data in a file, you are computing it only once. If the performance limiting factor is reading the file, then you should be looking to find ways to store the data in less space so that less data has to be read. What's the file format? How compact is it? – David Schwartz Jun 06 '16 at 08:45
  • @pyroscepter you want to load and parse your file at compile time, or you want to load and parse your file at start of program, and then just use result? – Arkady Jun 06 '16 at 08:48
  • @DavidSchwartz .dat file. Oh so you are saying it's impossible to do what I suggested at compile time? If it were possible I could just link the two object files together right? That's what I am after I think. To compute this at compile time. – pyroscepter Jun 06 '16 at 08:50
  • @pyroscepter, you have to understand, that size of such vector will be few gigs, and if you want to have it as a part of your binary (so, any run can just simply use it, because it was constructed and filled at compile time), that will increase your binary to few gigs. – Arkady Jun 06 '16 at 08:51
  • @Arkady Yeah I get it. But that will save all the compile time of converting the file into a vector all the time right? I guess I'm willing to trade. – pyroscepter Jun 06 '16 at 08:54
  • @pyroscepter look here: http://stackoverflow.com/questions/35743797/are-constexpr-functions-that-load-files-possible-in-c Anyway, if you need big array of doubles, you really can create tool that will parse your file into `double bla[] = {numbers .. numbers}`, containing all numbers, then you will be able to add it to your binary as special `*.cpp` file, and use as already existing array. According to that: http://en.cppreference.com/w/cpp/language/constexpr it is not possible to generate consexpr function that would return `std::vector` – Arkady Jun 06 '16 at 08:58
  • @pyroscepter Converting the file to a vector should be utterly trivial. If it's not, you're doing something terribly wrong. If the time consuming thing is reading the file, then you should work on making the file as small as possible. What is the file format? (Not what name do you give it. What is the actual binary format in which the numbers are encoded? Does it waste any bytes? Are you sure it's file read time?) – David Schwartz Jun 06 '16 at 09:03
  • @DavidSchwartz I'm not quite sure how to know the encoding of the file. Properties says it's a plain/text document. It stores only double type numbers. I'm using the vectors because I need to do computations on the numbers in the file, and vectors are pretty easy to use. What do your suggest instead? – pyroscepter Jun 06 '16 at 09:26
  • @pyroscepter I suggest two things: 1) Converting the file into a format that takes fewer bytes. 2) Reading the file in a more efficient way. – David Schwartz Jun 06 '16 at 09:28
1

While your question was not entirely clear, I assume that you want to:

  • compute a vector of doubles from a large set of data
  • use this computed (smaller) set of data in your program
  • do the computation at compile time

This is possible of course, but you will have to leverage whatever build system you are using. Without more specifics, I can only give a general answer:

  1. Create a helper program that you can invoke during compilation. This program should implement the extract function and dump the result into a file. You have two main choices here: go for a resource file that can be embedded into the executable, or generate source code that contains the data. If the data is not terribly large, I suggest the latter.

  2. Use the generated file in your program

For example:

Pre-build step extract_data.exe extracted_data_generated

This dumps the extracted data into a header and source, such as:

// extracted_data_generated.h
#pragma once
extern const std::array<double, 4> extracted;

// extracted_data_generated.cpp
#include "extracted_data_generated.h"
const std::array<double, 4> extracted{ { 1.2, 3.4, 5.6, 6.7 } }; //etc.

In other parts of your program, use the generated data

#include "extracted_data_generated.h"

// you have extracted available as a variable here.

I also changed to a std::array whose size you will know in your helper program because you will know the size of the vector.

The resource route is similar, but you will have to implement platform-specific extraction of the resource and reading the data. So unless your computed data is very large, I'd suggest the code generation.

Tamás Szelei
  • 23,169
  • 18
  • 105
  • 180
  • Thanks! This was exactly what I was looking for. At present it takes around a minute to execute this function, and will only increase when I start working with larger files. Your answer really helps! I'll update if/when I can get it working or if I have a problem. What should I change my question wording to, so it's clearer for others who see the question? – pyroscepter Jun 06 '16 at 09:22
  • @pyroscepter I'm glad it helped. I think if you incorporate my assumptions into the question and remove the last paragraph, it will be much more clear. – Tamás Szelei Jun 06 '16 at 09:26