1

I am implementing a range search code in c++ where I have a big file (20GB) and I need to search for a specific range for different queries.
I have divided the big file into smaller chunks to fasten the search where I have two levels a root and leaves and the data is stored in the leaves (following the same idea of an ISAM tree).

i.e: I have 3000 000 000 lines of data
Divided into 30000 pages each page with 100000 line A root that pints to each page (the root has 30000).

However, I noticed that once the search range starts at page 200 or higher the stream becomes significantly slower. I close each page after I am finished with it. So is there any reason why the reading stream becomes very slow?

  • I am running on a linux machine
  • I don't have the option of performing multi-threading
  • The reads are sequential from these files.
Community
  • 1
  • 1
Salma
  • 119
  • 3
  • 15
  • You need to provide more details and also do the profiling. You should mainly check what is happening in the system when you start seeing the slowness. See output of top, iostat, free, /proc/meminfo ... It looks more of a system level issue rather C++ issue unless you messed up in your logic – Arunmu Nov 03 '16 at 06:27
  • If you start seeing lots of page swaps, then probably it will make sense to enable hugepages on your system (I am assuming its linux) – Arunmu Nov 03 '16 at 06:28
  • Also, I assume you are doing some sort of sequential reading, doing random reads would suck up the performance. – Arunmu Nov 03 '16 at 06:30
  • Multi threading would also make sense if you are spending some considerable time in CPU working on the data rather than waiting on the data from disk. – Arunmu Nov 03 '16 at 06:31
  • See..I can't stop commenting ..:) So have marked the question for close as "Too broad". Please provide more details. – Arunmu Nov 03 '16 at 06:32
  • yes, it is a linux and I'm doing a sequential read. Multi-threading is not an option in my case. – Salma Nov 03 '16 at 06:34
  • I did try it on multiple machines and I'm facing the same issue. – Salma Nov 03 '16 at 06:37
  • Yeah, I never expected it work fast on a different machine. Your best bet is to make sense on what is going out in the system is to make use of the commands I mentioned before. If you have `sar` installed, that would be awesome. It's a great utility to dump all the needed information periodically. – Arunmu Nov 03 '16 at 06:39
  • 2
    Fast file I/O critically depends on whether the data is going to be present in the file system cache. At 20 GB, that's a lost cause. Sooner or later the cache is going to run out and your program is going to have to wait for the disk to supply the data. It now proceeds at disk speeds, never more than ~60 megabytes/second for a consumer-level spindle disk. Buy more RAM, buy an SSD, never wait for the program to complete. Do note that the cache cannot help when the file wasn't accessed before, the more typical usage that doesn't compare well with the way you are testing it now. – Hans Passant Nov 03 '16 at 09:43
  • And you'll be lucky to get even 60 megabyte/sec out of a consumer-grade SATA drive under likely real-world conditions. The only way a consumer-grade SATA drive will be able to support IO speeds in that range is with large-block sequential IO operations that don't require much seeking. In real-world use, that means the disk pretty much has to be dedicated to being used by a single process. 5400 rpm SATA drives can typically handle at most 50 IO ops/sec, 7200 rpm SATA drives do about 70. If each IO op is small and needs a head seek, IO throughput can be 20-30 kilobyte/sec. Or worse. – Andrew Henle Nov 03 '16 at 10:14
  • To test how fast you can read the file, run something like `/usr/bin/time dd if=/path/to/input/file of=/dev/null bs=1M [iflag=direct]` That will tell you how fast it's possible to read the entire file. You can try with and without the `iflag=direct` option to see if direct IO improves performance. See http://stackoverflow.com/questions/33485108/why-is-dd-with-the-direct-o-direct-flag-so-dramatically-faster And the `bs=1M` blocksize is probably overkill, but a blocksize that large *will* effectively force entirely sequential IO and help minimize disk seeks. – Andrew Henle Nov 03 '16 at 10:22

1 Answers1

0

why the reading stream becomes very slow?

Cache misses!

The speed of file parsing heavily depends on the file system cache. It they are present, then the parsing will be relatively fast. If not, it will not be that fast.

You have a big file (20GB), which is too big for a cache to fit. As a result, your cache will be exhausted and the program will be forced to fetch data from the disk (which will harm performance heavily).

gsamaras
  • 71,951
  • 46
  • 188
  • 305