Sip or gulp files?

Question

I have a series of text files that I want to read as the input to a finite state machine. As such, only one character of that file is needed at a time. Now as I understand it, memory access is a time-expensive operation, leading to this question:

Is it faster to load the entire (small to moderately sized) file into memory before using the data, or is ifstream optimized enough that repeatedly dipping into the hard-disk won't prove to be a performance hit?

`ifstream` provides a buffered read. – πάντα ῥεῖ Nov 03 '15 at 10:15 — πάντα ῥεῖ, Nov 03 '15 at 10:15

score 0 · Answer 1 · edited May 23 '17 at 11:51

0

As long as you have enough RAM for it, reading the entire file at once is usually faster. ifstream is buffered but (1) only you can know the best size of the buffer and (2) there are some overheads too (compare with e.g. this).

edited May 23 '17 at 11:51

Community

1
1

answered Nov 03 '15 at 10:26

matb

227
2
7

Francis Cugler · Accepted Answer · 2015-11-12T08:14:40.353

If you know that the file size is small you can read it all in one go, if it is large then it might be better to read in a line at a time. Now for storing what you have read it that may depend on the data types you are reading and how you are opening the file either it be by text or binary. If you are opening up in binary you can store them into an unsigned char*however you have to know how many bytes your data structure contains depending on the data types expected. If you are reading in a text file then you can save the contents to a string, however if you save it to a string you will need to parse the string accordingly. Opening and closing a file tends to be slower than working in memory. When you access files from a hard drive you are using the bus which tends to be slower and you may also have cache misses. Working in ram is usually faster.

An easy way to represent this would be the comparison of rendering 3D objects to the screen on the CPU versus sending batches of data sets to the GPU. Think of a 3D Graphics Engine that stores batches of vertex information where each vertex may contain the following information; (x,y,z) world position, (x,y) screen pixel information with an (r,g,b,a) color information, normal & tangent vector information for light processing and (t,s) or (u,v) for texture coordinate information. So a simple triangle may look like this:

// These Structs I Have Put Into Columns To Preserve Page Length   
//  8 Bytes                  16 Bytes               8 Bytes
struct Vec2f {           struct Vec2d {          struct Vec2i {
    union {                  union {                 union {
         float f2[2];            double d2[2];           int i2[2];
         // Positional
         struct {                struct {                struct {
             float x;                double x;               int x;
             float y;                double y;               int y;
         };                      };                      };
         // Texture Coords - Some Use S&T Others Use U&V
         struct {                struct {                struct {
             float s;                double s;               int s;
             float t;                double t;               int t;
         };                      };                      };
         /*struct {                struct {                struct {
             float u;                double u;               int u;
             float v;                double v;               int v;
         }; // Only Shown Here   };                      };*/
         // Color Values.
         struct {                struct {                struct {
             float r;                double r;               int r;
             float g;                double g;               int g;
         };                      };                      };
    };                       };                      };
};                       };                      };

//  12 Bytes                24 Bytes               12 Bytes
struct Vec3f {           struct Vec3d {          struct Vec3i {
    union {                  union {                 union {
         float f3[3];            double d3[3];           int i3[3];
         struct {                struct {                struct {
             float x;                double x;               int x;
             float y;                double y;               int y;
             float z;                double z;               int z;
         };                      };                      };
         struct {                struct {                struct {
             float s;                double s;               int s;
             float t;                double t;               int t;
             float p;                double p;               int p;
         };                      };                      };
         struct {                struct {                struct {
             float r;                double r;               int r;
             float g;                double g;               int g;
             float b;                double b;               int b;
         };                      };                      };
    };                       };                      };
};                       };                      };

//  16 Bytes                  32 Bytes               16 Bytes
struct Vec4f {           struct Vec4d {          struct Vec4i {
    union {                  union {                 union {
         float f4[4];            double d4[4];           int i4[4];
         struct {                struct {                struct {
             float x;                double x;               int x;
             float y;                double y;               int y;
             float z;                double z;               int z;
             float w;                double w;               int w;
         };                      };                      };
         struct {                struct {                struct {
             float s;                double s;               int s;
             float t;                double t;               int t;
             float p;                double p;               int p;
             float q;                double q;               int q;
         };                      };                      };
         struct {                struct {                struct {
             float r;                double r;               int r;
             float g;                double g;               int g;
             float b;                double b;               int b;
             float a;                double a;               int a;
         };                      };                      };
    };                       };                      };
};                       };                      };

// Not All Triangles Sent To Be Rendered To The Screen Will Have All Of This
// Information, But There Is A Good Chance That Many Will.
struct TriangleMeshF {
    // If We Do The Math For This One Triangle Mesh
    // The Float Vecs Are 8, 12 & 16 Bytes Each
    Vec3f v3Vertices[3]; // 3 Vertexes That Make This Triangle
    Vec4f v4Color[3];    // 3 Colors 1 For Each Vertex
    Vec2f v2TexCoord[3]; // 3 Texture Coords 1 For Each Vertex
    Vec3f v3Normal[3];   // 3 Normal Vectors 1 For Each Vertex
    Vec3f v3BiNormal[3]; 
    Vec3f v3Tangent[3];
    Vec3f v3BiTangent[3];
    // 12*3*5 + 16*3 + 8*3 = 252 Bytes  Or There are 63 4Byte Floats
}; 

// 252 Bytes Doesn't Seem Like A Large Amount Of Data But What Happens When 
// We Have A Model File That Fills In A vector<TrianleMesheF>
// Lets Say That This One Model File Fills This vector<T> with 80K TriangleMeshes
// The Amount Of Data Being Processed On The CPU Is Slower Than The GPU
// Also The Amount Of Rendering Calls From The CPU To The GPU Can Be Slow
// Because It Travels Over The Bus. This Is Why 3D Graphis Engine Developers
// Who Use Either DirectX Or OpenGL OR Both Use Batches To Collect Buckets
// Of Different Types Of Rendering Meshes Then Sends Them To The Graphics
// Card When The Buckets Become Full. This Is Kind Of How The Process
// Or Concept Of Programmable Shaders Came Into Play

// I Only Used Vec2f, 3f, & 4f Here, But Imagine If The User Had Used Vec4d
// Then The Mesh Would Double In Size And This Would Get Large Very Fast.

This may seem a bit off topic, but the concept here still applies. When you use the processor to communicate with the video card, sound card, hard drive, optical drive, network etc. over the bus, it is a slower process than working directly from ram just as conventional OS-CPU Ram is slower than GPU - Ram.

So it all depends on your data types, how you are reading in the files as text or as binary, and how large the file size is. As I stated before the two most common methods are either reading the file into a buffer structure all in one go, or by its contents line by line and saving into either an unsigned char array, char array, or string. If it is all pure numbers then it could be an array of ints. This will not work for floats or doubles for you would have to parse the text for that.

As you have stated you are working with a txt file, so more than likely the best option is if the file size is smaller than a specific file size limit that you have set then it is easy to open the file and read line by line. If it is larger than this but smaller than TOO LARGE you can read it all in one go. If it is too large for the amount of ram available or too large to fit in a single buffer, then you have a few options, you could set a buffer size say 4MB or 2GB and as you are reading in the file and saving it to a buffer you would have to count how much you have read in, then once your max buffer size is reached you will have to store that first buffer, then create a new buffer and continue reading and repeat this until the EOF is reached. Reading line by may also work as well which is a little nicer to work with after the reading process. Line by line makes it easier to parse the strings of text. As opposed to having to separate this buffer into lines of text.

However to do this, you would have to have two or three different parsing mechanisms in place. One for reading all and the other that reads line by line and determining when to do which. It may only require a few extra functions or methods, but the whole read you may want to parse looking for the newline character '/0' then saving this into a string and pushing this string into a vector. Then pass this vector off to your parser to look for keywords or tokens. The line by line can just be saved directly into your vector container. As in determining the file size limits you would need to know to switch between readAll() versus readLine() you may have to do a bench mark speed test to determine the appropriate file size. I hope this helps you in your decision making.

I made a simple edit to this; I had noticed in the Vec4 structures their last parameter for positional data was a `z` and should of been a `w`. I made the appropriate corrections. — Francis Cugler, Nov 12 '15 at 08:16

Sip or gulp files?

2 Answers2