very fast text file processing (C++)

Question

i wrote an application which processes data on the GPU. Code works well, but i have the problem that the reading part of the input file (~3GB, text) is the bottleneck of my application. (The read from the HDD is fast, but the processing line by line is slow).

I read a line with getline() and copy line 1 to a vector, line2 to a vector and skip lines 3 and 4. And so on for the rest of the 11 mio lines.

I tried several approaches to get the file at the best time possible:

Fastest method I found is using boost::iostreams::stream

Others were:

Read the file as gzip, to minimize IO, but is slower than directly reading it.
copy file to ram by read(filepointer, chararray, length) and process it with a loop to distinguish the lines (also slower than boost)

Any suggestions how to make it run faster?

void readfastq(char *filename, int SRlength, uint32_t blocksize){
    _filelength = 0; //total datasets (each 4 lines)
    _SRlength = SRlength; //length of the 2. line
    _blocksize = blocksize;

    boost::iostreams::stream<boost::iostreams::file_source>ins(filename);
    in = ins;

    readNextBlock();
}


void readNextBlock() {
    timeval start, end;
    gettimeofday(&start, 0);

    string name;
    string seqtemp;
    string garbage;
    string phredtemp;

    _seqs.empty();
    _phred.empty();
    _names.empty();
    _filelength = 0;

            //read only a part of the file i.e the first 4mio lines
    while (std::getline(in, name) && _filelength<_blocksize) {
        std::getline(in, seqtemp);
        std::getline(in, garbage);
        std::getline(in, phredtemp);

        if (seqtemp.size() != _SRlength) {
            if (seqtemp.size() != 0)
                printf("Error on read in fastq: size is invalid\n");
        } else {
            _names.push_back(name);

            for (int k = 0; k < _SRlength; k++) {

                //handle special letters
                                    if(seqtemp[k]== 'A') ...
                                    else{
                _seqs.push_back(5);
                                    }

            }
            _filelength++;
        }
    }

EDIT:

The source-file is downloadable under https://docs.google.com/open?id=0B5bvyb427McSMjM2YWQwM2YtZGU2Mi00OGVmLThkODAtYzJhODIzYjNhYTY2

I changed the function readfastq to read the file, because of some pointer problems. So if you call readfastq the blocksize (in lines) must be bigger than the number of lines to read.

SOLUTION:

I found a solution, which get the time for read in the file from 60sec to 16sec. I removed the inner-loop which handeles the special characters and do this in GPU. This decreases the read-in time and only minimal increases the GPU running time.

Thanks for your suggestions.

void readfastq(char *filename, int SRlength) {
    _filelength = 0;
    _SRlength = SRlength;

    size_t bytes_read, bytes_expected;

    FILE *fp;
    fp = fopen(filename, "r");

    fseek(fp, 0L, SEEK_END); //go to the end of file
    bytes_expected = ftell(fp); //get filesize
    fseek(fp, 0L, SEEK_SET); //go to the begining of the file

    fclose(fp);

    if ((_seqarray = (char *) malloc(bytes_expected/2)) == NULL) //allocate space for file
        err(EX_OSERR, "data malloc");


    string name;
    string seqtemp;
    string garbage;
    string phredtemp;

    boost::iostreams::stream<boost::iostreams::file_source>file(filename);


    while (std::getline(file, name)) {
        std::getline(file, seqtemp);
        std::getline(file, garbage);
        std::getline(file, phredtemp);

        if (seqtemp.size() != SRlength) {
            if (seqtemp.size() != 0)
                printf("Error on read in fastq: size is invalid\n");
        } else {
            _names.push_back(name);

            strncpy( &(_seqarray[SRlength*_filelength]), seqtemp.c_str(), seqtemp.length()); //do not handle special letters here, do on GPU

            _filelength++;
        }
    }
}

Note that [`string::empty()`](http://www.cplusplus.com/reference/string/string/empty/) and [`vector::empty()`](http://www.cplusplus.com/reference/stl/vector/empty/) are read-only tests on the state of the container. Perhaps you meant `.clear()`? — André Caron, Nov 14 '11 at 15:10
possible duplicate of [What is the Fastest Method for High Performance Sequential File I/O in C++?](http://stackoverflow.com/questions/1201261/what-is-the-fastest-method-for-high-performance-sequential-file-i-o-in-c) — Ben Voigt, Nov 14 '11 at 15:10
`_seqs.empty();` just returns `true`. The default constructor creates an empty string, and `bool std::string::empty() const` is not the same as `void std::string::clear()`. — MSalters, Nov 14 '11 at 15:11
I don't know the problem you are solving, but are you sure a text file is the way to go with such a large data set and an apparently time critical application? — Gabriel Schreiber, Nov 14 '11 at 15:13
Post compilable code so we can test it and see where the bottleneck is. — Martin York, Nov 14 '11 at 16:25
you are righte i confounded clear and empty, but changes nothing necessary — mic, Nov 14 '11 at 18:41
@BenVoigt my issue is related to read line by line, i read a binary file (int values) also in my application which works really fast with the read(fd, dataarray, bytes_expected); — mic, Nov 14 '11 at 19:34

valdo · Answer 1 · 2011-11-14T15:18:32.133

First instead of reading the file into memory you may work with file mappings. You just have to build your program as 64-bit to fit 3GB of virtual address space (for 32-bit application only 2GB is accessible in the user mode). Or alternatively you may map & process your file by parts.

Next, it sounds to me that your bottleneck is "copying a line to a vector". Dealing with vectors involves dynamic memory allocation (heap operations), which in a critical loop hits the performance very seriously). If this is the case - either avoid using vectors, or make sure they're declared outside the loop. The latter helps because when you reallocate/clear vectors they do not free memory.

Post your code (or a part of it) for more suggestions.

EDIT:

It seems that all your bottlenecks are related to string management.

std::getline(in, seqtemp); reading into an std::string deals with the dynamic memory allocation.
_names.push_back(name); This is even worse. First the std::string is placed into the vector by value. Means - the string is copied, hence another dynamic allocation/freeing happens. Moreover, when eventually the vector is internally reallocated - all the contained strings are copied again, with all the consequences.

I recommend using neither standard formatted file I/O functions (Stdio/STL) nor std::string. To achieve better performance you should work with pointers to strings (rather than copied strings), which is possible if you map the entire file. Plus you'll have to implement the file parsing (division into lines).

Like in this code:

class MemoryMappedFileParser
{
    const char* m_sz;
    size_t m_Len;

public:

    struct String {
        const char* m_sz;
        size_t m_Len;
    };

    bool getline(String& out)
    {
        out.m_sz = m_sz;

        const char* sz = (char*) memchr(m_sz, '\n', m_Len);
        if (sz)
        {
            size_t len = sz - m_sz;

            m_sz = sz + 1;
            m_Len -= (len + 1);

            out.m_Len = len;

            // for Windows-format text files remove the '\r' as well
            if (len && '\r' == out.m_sz[len-1])
                out.m_Len--;
        } else
        {
            out.m_Len = m_Len;

            if (!m_Len)
                return false;

            m_Len = 0;
        }

        return true;
    }

};

I disagree. I think it can be written well using the standard libraries (but the OP has not posted enough code for context). By doing the above you are just making the code very brittle. — Martin York, Nov 14 '11 at 16:25
I don't know what do you mean by "brittle". The code is either correct or not IMHO. But nevertheless I don't insist. If you have a "standard" library/functions that do the same - you're welcome — valdo, Nov 14 '11 at 16:27
Brittle means easy to use incorrectly in a way that will make the code incorrect. Of course if used as intended it will work. But brittle code is horrible to maintain and use past the original author (as only they know how to use it correctly). — Martin York, Nov 14 '11 at 16:36
Alright, don't like the "brittle" code - post your suggestion, written in terms of "standard" liraries — valdo, Nov 14 '11 at 16:45

moooeeeep · Answer 2 · 2011-11-14T16:50:27.880

5

if _seqs and _names are std::vectors and you can guess the final size of them before processing the whole 3GB of data, you can use reserve to avoid most of the memory re-allocation during pushing back the new elements in the loop.

You should be aware of the fact that the vectors effectively produce another copy of parts of the file in main memory. So unless you have a main memory sufficiently large to store the text file plus the vector and its contents, you will probably end up with a number of page faults that also have a significant influence on the speed of your program.

edited Nov 14 '11 at 16:50

answered Nov 14 '11 at 15:25

moooeeeep

31,622
22
98
187

1

i reserved space for vector and string before the loop, saved 2 out of 60 seconds. – mic Nov 14 '11 at 18:42

score 2 · Answer 3 · answered Nov 14 '11 at 14:40

2

You are apparently using <stdio.h> since using getline.

Perhaps fopen-ing the file with fopen(path, "rm"); might help, because the m tells (it is a GNU extension) to use mmap for reading.

Perhaps setting a big buffer (i.e. half a megabyte) with setbuffer could also help.

Probably, using the readahead system call (in a separate thread perhaps) could help.

But all this are guesses. You should really measure things.

answered Nov 14 '11 at 14:40

Basile Starynkevitch

223,805
18
296
547

4

There is a [`getline`](http://en.cppreference.com/w/cpp/string/basic_string/getline) function in C++ as well. – Some programmer dude Nov 14 '11 at 14:45

Nikolai Fetissov · Answer 4 · 2011-11-14T21:24:18.887

2

General suggestions:

Code the simplest, most straight-forward, clean approach,
Measure,
Measure,
Measure,

Then if all else fails:

Read raw bytes (read(2)) in page-aligned chunks. Do so sequentially, so kernel's read-ahead plays to your advantage.
Re-use the same buffer to minimize cache flushing.
Avoid copying data, parse in place, pass around pointers (and sizes).

mmap(2)-ing [parts of the] file is another approach. This also avoids kernel-userland copy.

edited Nov 14 '11 at 21:24

answered Nov 14 '11 at 15:05

Nikolai Fetissov

82,306
11
110
171

The stream buffer is already doing this. No point to do it manually. – Martin York Nov 14 '11 at 16:26
Yes. I find no real advantage to doing it manually. Any local performance increases are out-weighed by maintenance costs and improvements in algorithm on the larger scale. – Martin York Nov 14 '11 at 18:30
Sure. Show me a spec of what needs to be done. Writing something based on only half the information(like the above) is not going to be much use as I would end up optimizing for the local situation without knowing the greater context. Premature optimization like that is just a waste of time. :-) Looking at the link above this actually all about manipulating DNA sequence we should be able to use that information to make it more efficient. But without the full spec how can I tell. – Martin York Nov 14 '11 at 20:42
Do you expect to be paid too? :) – Nikolai Fetissov Nov 14 '11 at 20:47
What I am trying to say. Your optimization of using a structure with a pointer and a size (like @valdo) above is premature. If all the code is doing is printing the result then fine (it may speed up the application). But if the strings need to be manipulated you are going to need a copy. Thus all you are doing is **moving** the cost from the load function into the manipulation function (overall there will be no performance improvement in the application). In addition you are making the code more brittle by hand rolling your own technique for handling strings. – Martin York Nov 14 '11 at 21:06
But it is hard to tell what is exactly the best method to move forward without knowing more about the application. Maybe input/output serialization format can be improved; Maybe your suggestion of page aligned chunks is OK, but without more information we are optimizing way to locally without considering the cost in other parts of the application. – Martin York Nov 14 '11 at 21:07
That's why I said "general suggestions". I admit I forgot "measure" as the first bullet point. Will add. – Nikolai Fetissov Nov 14 '11 at 21:21

score 0 · Answer 5 · answered Nov 14 '11 at 15:36

0

Depending on your disk speed, using a very fast de compression algorithm might help, like fastlz (there are at least two other that might be more efficient, but under GPL, so licence can be a problem).

Also, using C++ data structures and functions car increase the speed as you can maybe achieve a better compiler-time optimization. Going the C way isn't always the fastes! In some bad conditions, using char* you need to parse the whole string to reach the \0 yielding desastrous performances.

For parsing your data, using boost::spirit::qi is also probably the most optimized approach http://alexott.blogspot.com/2010/01/boostspirit2-vs-atoi.html

answered Nov 14 '11 at 15:36

Tristram Gräbener

9,601
3
34
50

the bootleneck is not the IO of the HDD itself (high-performance raid system). The approach with a compressed file was tried but was slower than reading uncompressed data (see initial post) – mic Nov 15 '11 at 07:58
GZip is oriented towards compression performance, not compression speed. QuickLZ, or FastLZ are decompression speed oriented. Look at their benchmarks http://www.quicklz.com/. But of course if you read at 300Mb/sec, then it might not help – Tristram Gräbener Nov 16 '11 at 13:37

very fast text file processing (C++)

5 Answers5

Linked