1

I have large amount of tab delimited Flat File Table. I want to load all data in a 2D vector Container in Quicker time. I have given my code below. I checked my code with Ofast, Os and O2 Complexities. But It will take nearly 20 Seconds to load 100,000 records with 4 columns. But I want to load 500,000 records within 1 Second. How can I achieve it. ?

typedef vector <string> record_t;
typedef vector <record_t> table_t;

fstream& operator >> ( fstream& ins, record_t& r_record )
  {

  r_record.clear();

  string s_line;
  getline( ins, s_line );

  stringstream ss( s_line );
  string s_field;
  while (getline( ss, s_field, '\t' ))
    {

        r_record.push_back( s_field );
    }


  return ins;
  }


fstream& operator >> ( fstream& ins, table_t& t_data )
  {

  t_data.clear();


  record_t r_record;
  while (ins >> r_record)
    {
    t_data.push_back( r_record );
    }


  return ins;  
  }

    fstream fs("somesamplefile.txt",ios::in);
    table_t table;
    fs>>table;

Time Difference is :

Os      22.860000 Seconds
Ofast   21.320000 Seconds
O2      22.660000 Seconds
Smith Dwayne
  • 2,675
  • 8
  • 46
  • 75
  • 1
    Do you know the size up front? If so you could call `reserve()` and avoid some reallocations. – jaket Dec 18 '14 at 06:05
  • Actually I am not sure How many records the table contains ? It may have over 600,000 records also. How could I use that reserve() function for random amount of records ? – Smith Dwayne Dec 18 '14 at 06:06
  • If you don't know there is really no point in calling reserve. Can you use a linked list? – jaket Dec 18 '14 at 06:11
  • Yes I can. But is there any other standard containers for Linked List ? – Smith Dwayne Dec 18 '14 at 06:13
  • What hardware, OS, etc. are you using? On my system (Linux, 64-bit, 3GHz), your existing code loaded 500K rows of 4 columns in 0.7 seconds. Total file size 9.8 MB (how about yours?). – John Zwinck Dec 18 '14 at 06:17
  • Ubuntu 12.04 32 bit , Kernel Linux 3.2.0-29-generic-pae , 4 GB Ram, 2.80 GHz, 100K rows of 4 Columns. File Size 2.1 Mb – Smith Dwayne Dec 18 '14 at 06:25
  • OK well, I tried on Ubuntu 10.04 64-bit, 100K rows of 4 columns, file size 2.0 MB, and it runs in 0.2 seconds *without optimizations enabled.* The performance discrepancy here is huge. Is your system a regular desktop or something special? Have you tried profiling? – John Zwinck Dec 18 '14 at 06:28
  • regular desktop itself. What about Profiling ? I don't have any Idea about it. – Smith Dwayne Dec 18 '14 at 06:33
  • http://stackoverflow.com/questions/375913/what-can-i-use-to-profile-c-code-in-linux - but I suggest you can also try writing a "C" version of the same code, using `fopen()` and `fgets()` instead of `fstream` and `getline()`. You can still use `vector` and `string` to store the data, but you may find the C++ iostream overhead is hurting you. – John Zwinck Dec 18 '14 at 06:36
  • Dunno what the issue is. MacBook Air i7 duo-core 4gB, OSX 10.9.5 clang 3.5, release-build (O2) I pull a 9.2MB 500,000 4 columns file with your existing code in ~514ms. Full-debug pulls 1340ms. Something on that system is heinous, with times 15x debug and 40x release. – WhozCraig Dec 18 '14 at 06:54
  • This is my Compiling Code. g++ -Wall -Ofast -g -std=c++0x vectortab.cpp -o vecttab – Smith Dwayne Dec 18 '14 at 06:58
  • I use Valgrind to run My code. valgrind --leak-check=yes --show-reachable=yes --track-origins=yes ./vecttab > debug.txt 2>&1 – Smith Dwayne Dec 18 '14 at 07:05
  • @ WhozGraig . Is it a 32 or 64 bit processor? – Smith Dwayne Dec 18 '14 at 07:14
  • @SmithDwayne 64bit (mid 2011) – WhozCraig Dec 18 '14 at 07:16
  • any body can test this code in 32 bit Machine ? – Smith Dwayne Dec 18 '14 at 07:43

1 Answers1

1

If your platform permits it - it probably does - try reading the entire file into a single memory buffer, then from the buffer into your vector.

If your platform has memory-mapping - Linux, BSD, Mac OS X and Windows all do - it's faster, and uses less memory to memory-map a file than to use file I/O system calls.

Whether you use file I/O (like UNIX' read(2) system call), or mapping (mmap(2) on *NIX, I don't recall what the Windows equivalent is called), you'll avoid a great many system calls. I expect getline does some buffering itself, but the buffering won't be that big.

Mike Crawford
  • 2,232
  • 2
  • 18
  • 28