1

I have to read a text file into a array of structures.I have already written a program but it is taking too much time as there are about 13 lac structures in the file. Please suggest me the best possible and fastest way to do this in C++.

here is my code:

std::ifstream input_counter("D:\\cont.txt");

/**********************************************************/
int counter = 0;
while( getline(input_counter,line) )
{
    ReadCont( line,&contract[counter]); // function to read data to structure
    counter++;
    line.clear();
}
input_counter.close();
SingerOfTheFall
  • 29,228
  • 8
  • 68
  • 105
Rsvay
  • 1,285
  • 2
  • 13
  • 36
  • 2
    No simple answer to that. Need more information, like what the structure is, how they were written in the first place, what OS you are using etc. and perhaps most importantly how you are currently reading them. – john Sep 03 '13 at 05:35
  • What does '13 lac' mean? – john Sep 03 '13 at 05:36
  • @john: we are using Qt and its platform independent. – Rsvay Sep 03 '13 at 05:45
  • @user26117519 OK, so you want a platform independent solution, that's one more piece of information. – john Sep 03 '13 at 05:47
  • I would expect it to go faster with C style I/O into a char array (assuming you can put a maximum size on any line). Even if you can't do that C I/O is faster. – john Sep 03 '13 at 05:52
  • Python (not a speed monster) on my pc (not a speed monster) reading and splitting into rows/cols such a file (21Mb total, 130000 rows with 80 fields) takes about 0.5sec. You should first investigate on where time is really lost. http://i.qkme.me/3vny7g.jpg – 6502 Sep 03 '13 at 16:23

4 Answers4

1

I would use Qt entirely in this case.

struct MyStruct {
    int Col1;
    int Col2;
    int Col3;
    int Col4;
    // blabla ...
};

QByteArray Data;
QFile f("D:\\cont.txt");
if (f.open(QIODevice::ReadOnly)) {
    Data = f.readAll();
    f.close();
}

MyStruct* DataPointer = reinterpret_cast<MyStruct*>(Data.data());
// Accessing data
DataPointer[0] = ...
DataPointer[1] = ...

Now you have your data and you can access it as array.

In case your data is not binary and you have to parse it first you will need a conversion routine. For example if you read csv file with 4 columns:

QVector<MyStruct> MyArray;
QString StringData(Data);
QStringList Lines = StringData.split("\n"); // or whatever new line character is
for (int i = 0; i < Lines.count(); i++) {
    String Line = Lines.at(i);
    QStringList Parts = Line.split("\t"); // or whatever separator character is
    if (Parts.count() >= 4) {
        MyStruct t;
        t.Col1 = Parts.at(0).toInt();
        t.Col2 = Parts.at(1).toInt();
        t.Col3 = Parts.at(2).toInt();
        t.Col4 = Parts.at(3).toInt();
        MyArray.append(t);
    } else { 
        // Malformed input, do something
    }
}

Now your data is parsed and in MyArray vector.

bkausbk
  • 2,740
  • 1
  • 36
  • 52
  • @user231502 f.readAll() should read entire data as fast as possible. Of course if the file is too large i.e. can not read into memory you will get an exception. try/catch is your friend here. – bkausbk Sep 03 '13 at 06:29
  • our file contains data in this format:|PE|1|0|0|0|0|1|1||2|0||2|0||3|0|.there are around 80 fields in each line.and around 130000 lines there in whole text file. – Rsvay Sep 03 '13 at 06:32
  • OP need to load large text file, It may block the thread such a long time, can you have a thread to process inn separate, if OP needs like that way ? – Ashif Sep 03 '13 at 06:33
  • our main concern is with time taken in parsing otherwise our current code is working fine. – Rsvay Sep 03 '13 at 06:38
  • Reading can be done in worker thread of course. Concerning parsing ... just replace \t separator with your own | character. – bkausbk Sep 03 '13 at 06:52
  • yes we can do that but the whole program is dependent on this file.we need to load this file in beginning. – Rsvay Sep 03 '13 at 06:54
  • It is easy ... you have to wait then. A splash screen is a good solution here. The only thing you can do is to read the file as fast as possible like I showed abvove. – bkausbk Sep 03 '13 at 07:03
1

keep your 'parsing' as simple as possible: where you know the field' format apply the knowledge, for instance

ReadCont("|PE|1|0|0|0|0|1|1||2|0||2|0||3|0|....", ...)

should apply fast char to integer conversion, something like

ReadCont(const char *line, Contract &c) {
   if (line[1] == 'P' && line[2] == 'E' && line[3] == '|') {
     line += 4;
     for (int field = 0; field < K_FIELDS_PE; ++field) {
       c.int_field[field] = *line++ - '0';
       assert(*line == '|');
       ++line;
     }
   }

well, beware to details, but you got the idea...

CapelliC
  • 59,646
  • 5
  • 47
  • 90
1

As user2617519 says, this can be made faster by multithreading. I see that you are reading each line and parsing it. Put these lines in a queue. Then let different threads pop them off the queue and parse the data into structures.
An easier way to do this (without the complication of multithreading) is to split the input data file into multiple files and run an equal number of processes to parse them. The data can then be merged later.

Testing
  • 76
  • 3
1

QFile::readAll() may cause a memory problem and std::getline() is slow (as is ::fgets()).

I faced a similar problem where I needed to parse very large delimited text files in a QTableView. Using a custom model, I parsed the file to find the offsets to the start of a each line. Then when data is needed to display in the table I read the line and parse it on demand. This results in a lot of parsing, but that is actually fast enough to not notice any lag in scrolling or update speed.

It also has the added benefit of low memory usage as I do not read the file contents into memory. With this strategy nearly any size file is possible.

Parsing code:

m_fp = ::fopen(path.c_str(), "rb"); // open in binary mode for faster parsing
if (m_fp != NULL)
{
  // read the file to get the row pointers
  char buf[BUF_SIZE+1];

  long pos = 0;
  m_data.push_back(RowData(pos));
  int nr = 0;
  while ((nr = ::fread(buf, 1, BUF_SIZE, m_fp)))
  {
    buf[nr] = 0; // null-terminate the last line of data
    // find new lines in the buffer
    char *c = buf;
    while ((c = ::strchr(c, '\n')) != NULL)
    {
      m_data.push_back(RowData(pos + c-buf+1));
      c++;
    }
    pos += nr;
  }

  // squeeze any extra memory not needed in the collection
  m_data.squeeze();
}

RowData and m_data are specific to my implementation, but they are simply used to cache information about a row in the file (such as the file position and number of columns).

The other performance strategy I employed was to use QByteArray to parse each line, instead of QString. Unless you need unicode data, this will save time and memory:

// optimized line reading procedure
QByteArray str;
char buf[BUF_SIZE+1];
::fseek(m_fp, rd.offset, SEEK_SET);
int nr = 0;
while ((nr = ::fread(buf, 1, BUF_SIZE, m_fp)))
{
  buf[nr] = 0; // null-terminate the string
  // find new lines in the buffer
  char *c = ::strchr(buf, '\n');
  if (c != NULL)
  {
    *c = 0;
    str += buf;
    break;
  }
  str += buf;
}

return str.split(',');

If you need to split each line with a string, rather than a single character, use ::strtok().

Lol4t0
  • 12,444
  • 4
  • 29
  • 65
Jason
  • 389
  • 2
  • 6
  • *It also has the added benefit of low memory usage as I do not read the file contents into memory.* Of course it only performs well because you do use the memory, only you don't realize it. The operating system caches the file for you, and you waste extra time on every row to re-parse this stored-in-RAM file. It is not generally a very clever thing to do to read any files in the gui thread. If the system is busy, an uncached file read can block for hundreds of milliseconds. – Kuba hasn't forgotten Monica Sep 18 '13 at 18:23
  • Alas, the `QAbstractItemModel` does not implement a request-response interface: a call to `data()` is expected to return ASAP. I've found an oft acceptable workaround: return a dummy value in `data()`, at the same time queuing the work to be done in a worker thread. The worker does the blocking and slow reads from the file and eventually fires the model's `dataChanged()` signal. – Kuba hasn't forgotten Monica Sep 18 '13 at 18:25
  • When you're scrolling all around your table, shortly the OS will likely read the entire file for you anyway, so you are doing a `readAll()` with all of its costs, except that the memory is not in your application, but in the OS block cache. – Kuba hasn't forgotten Monica Sep 18 '13 at 18:27
  • Fair enough. In my application the speed improvement was remarkable - from a 10-20 second delay to unnoticeable when first displaying the contents. I also cache a parsed row of data so that sequential queries to the same row do not require another read/parse. With that improvement scrolling performance provides no noticeable delay. – Jason Sep 18 '13 at 19:54
  • This is an expected improvement. The next step is to move the file access to a separate thread. I hate when entire applications get frozen due to, say, a temporarily disconnected network cable that happened to carry a network share with currently open file. – Kuba hasn't forgotten Monica Sep 18 '13 at 20:44