5

I'm trying to split a massive QByteArray which contains UTF-8 encoded plain text(using whitespace as delimiter) with the best performance possible. I found that I can achieve much better results if I convert the array to QString first. I tried using the QString.split function using a regexp, but the performance was horrendous. This code turned out to be way faster:

QMutex mutex;
QSet<QString> split(QByteArray body)
{
    QSet<QString>  slova;

    QString s_body = QTextCodec::codecForMib(106)->toUnicode(body);
    QString current;

    for(int i = 0; i< body.size(); i++){
        if(s_body[i] == '\r' || s_body[i] == '\n' || s_body[i] == '\t' || s_body[i] == ' '){

            mutex.lock();
            slova.insert(current);

            mutex.unlock();
            current.clear();
            current.reserve(40);
        } else {
            current.push_back(s_body[i]);
        }
    }
    return slova;
}

"Slova" is a QSet<QString> currently, but I could use a std::set or any other format. This code is supposed to find how many unique words there are in the array, with the best performance possible.

Unfortunately, this code runs far from fast enough. I'm looking to squeeze the absolute maximum out of this.

Using callgrind, I found that the most gluttonous internal functions were:

QString::reallocData (18% absolute cost)
QString::append (10% absolute cost)
QString::operator= (8 % absolute cost)
QTextCodec::toUnicode (8% absolute cost)

Obviously, this has to do with memory allocation stemming from the push_back function. What is the most optimal way to solve this? Doesn't necessarily have to be a Qt solution - pure C or C++ are also acceptable.

Toby Speight
  • 27,591
  • 48
  • 66
  • 103
user129186
  • 1,156
  • 2
  • 14
  • 30
  • Where is QByteArray in the source code? – Alexander V Apr 28 '16 at 21:26
  • I apologize, "body" is the array. Will add the previous line now. – user129186 Apr 28 '16 at 21:28
  • What kind of data is there in QByteArray? Is that necessary to put it in that type first? And the output should be what type? I mean 'slova' 'words'? – Alexander V Apr 28 '16 at 21:40
  • The QByteArray contains plain text encoded in UTF-8. And yes, it is necessary to put it in that type - the http server library I use returns bodies of http requests in that type. The output type does not necessarily matter. "Slova" is a QSet currently, but I could use the STD set for any other format. This code is supposed to find how many unique words there are in the array, with the best performance possible. – user129186 Apr 28 '16 at 21:44
  • Possible solution includes 'bare' buffer with UTF characters and some effective tokenizer e.g. boost tokenizer you also better don't lock and unlock the mutex all the time but lock just once before the update and unlock after. I can also easily make up Qt solution I can think of but that would probably be not as fast. Also unsure what 'current' variable does in your code. – Alexander V Apr 28 '16 at 21:55
  • The "current" variable is used to store the individual words between whitespaces - a buffer, basically. – user129186 Apr 28 '16 at 22:00
  • I guess you need to update your question with all that info so that people see the task you are trying to accomplish. I defer Qt solution for a while. The better requires a bit more of coding. – Alexander V Apr 28 '16 at 22:04
  • As an aside, why use QTextCodec here? QString has a built in [fromUtf8 method](http://doc.qt.io/qt-5/qstring.html#fromUtf8-1) that takes a byte array. – MrEricSir Apr 28 '16 at 23:57
  • try [QByteArray::split(char sep)](http://doc.qt.io/qt-4.8/qbytearray.html#split) directly. Not with regexps – Andrei R. Apr 29 '16 at 03:40
  • Related: [How can I partition a QByteArray efficiently?](https://stackoverflow.com/questions/5978124/how-can-i-partition-a-qbytearray-efficiently) – iammilind Oct 21 '19 at 14:21

3 Answers3

3

Minimise the amount of copying you need to do. Keep the input buffer in UTF-8, and don't store std::string or QString in your set; instead, create a small class to reference the existing UTF-8 data:

#include <QString>

class stringref {
    const char *start;
    size_t length;

public:
    stringref(const char *start, const char *end);
    operator QString() const;
    bool operator<(const stringref& other) const;
};

This can encapsulate a substring of the UTF-8 input. You'll need to ensure that it doesn't outlive the input string; you could do this by clever use of std::shared_ptr, but if the code is reasonably self-contained, then it should be tractable enough to reason about the lifetime.

We can construct it from a pair of pointers into our UTF-8 data, and convert it to QString when we want to actually use it:

stringref::stringref(const char *start, const char *end)
    : start(start), length(end-start)
{}

stringref::operator QString() const
{
    return QString::fromUtf8(start, length);
}

You need to define operator< so you can use it in a std::set.

#include <cstring>
bool stringref::operator<(const stringref& other) const
{
    return length == other.length
        ? std::strncmp(start, other.start, length) < 0
        : length < other.length;
}

Note that we sort by length before dereferencing pointers, to reduce cache impact.


Now we can write the split method:

#include <set>
#include <QByteArray>
std::set<stringref> split(const QByteArray& a)
{
    std::set<stringref> words;

    // start and end
    const auto s = a.data(), e = s + a.length();

    // current word
    auto w = s;

    for (auto p = s;  p <= e;  ++p) {
        switch (*p) {
        default: break;
        case ' ': case '\r': case '\n': case '\t': case '\0':
            if (w != p)
                words.insert({w, p});
            w = p+1;
        }
    }

    return words;
}

The algorithm is pretty much yours, with the addition of the w!=p test so that runs of whitespace don't get counted.


Let's test it, and time the important bit:

#include <QDebug>
#include <chrono>
int main()
{
    QByteArray body{"foo bar baz\n  foo again\nbar again "};
    // make it a million times longer
    for (int i = 0;  i < 20;  ++i)
        body.append(body);

    using namespace std::chrono;
    const auto start = high_resolution_clock::now();

    auto words = split(body);

    const auto end = high_resolution_clock::now();
    qDebug() << "Split"
             << body.length()
             << "bytes in"
             << duration_cast<duration<double>>(end - start).count()
             << "seconds";

    for (auto&& word: words)
        qDebug() << word;
}

I get:

Split 35651584 bytes in 1.99142 seconds
"bar"
"baz"
"foo"
"again"

Compiling with -O3 reduced that time to 0.6188 seconds, so don't forget to beg the compiler for help!

If that's still not fast enough, it's probably time to start to look at parallelising the task. You'll want to split the string into roughly equal lengths, but advance to the next whitespace so that no work straddles two threads worth of work. Each thread should create its own set of results, and the reduction step is then to merge the result sets. I won't provide a full solution for this, as that's another question in its own right.

Toby Speight
  • 27,591
  • 48
  • 66
  • 103
  • I've now tried splitting the Moby Dick text: "Split 1257296 bytes in 0.101347 seconds" giving 33780 distinct words. Just in case you thought my example with only 4 distinct words is unrepresentative! And on my i7-3770 machine, I get about twice the throughput, compared to the Q6600 I used when answering. – Toby Speight Apr 29 '16 at 10:34
2

The first thing I'd do if I were you is modify your code so it isn't locking and unlocking a QMutex for ever word it inserts into the QSet -- that's pure overhead. Either lock the QMutex only once, at the beginning of the loop, and unlock it again after the loop terminates; or better yet, insert into a QSet that isn't accessible from any other thread, so that you don't need to lock any QMutexes at all.

With that out of the way, the second thing to do is eliminate as many heap allocations as possible. Ideally you'd execute the entire parse without ever allocating or freeing any dynamic memory at all; my implementation below does that (well, almost -- the unordered_set might do some internal allocations, but it probably won't). On my computer (a 2.7GHz Mac Mini) I measure a processing speed of around 11 million words per second, using the Gutenberg ASCII text of Moby Dick as my test input.

Note that due to the backward-compatible encoding that UTF-8 uses, this program will work equally well with either UTF-8 or ASCII input.

#include <ctype.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>
#include <unordered_set>

// Loads in a text file from disk into an in-memory array
// Expected contents of the file are ASCII or UTF8 (doesn't matter which).
// Note that this function appends a space to the end of the returned array
// That way the parsing function doesn't have to include a special case
// since it is guaranteed that every word in the array ends with whitespace
static char * LoadFile(const char * fileName, unsigned long * retArraySizeBytes)
{
   char * ret = NULL;
   *retArraySizeBytes = 0;

   FILE * fpIn = fopen(fileName, "r");
   if (fpIn)
   {
      if (fseek(fpIn, 0L, SEEK_END) == 0)
      {
         const unsigned long fileSizeBytes  = ftell(fpIn);
         const unsigned long arraySizeBytes = *retArraySizeBytes = fileSizeBytes+1;  // +1 because I'm going to append a space to the end
         rewind(fpIn);

         ret = new char[arraySizeBytes];
         if (fread(ret, 1, fileSizeBytes, fpIn) == fileSizeBytes)
         {
            ret[fileSizeBytes] = ' ';  // appending a space allows me to simplify the parsing step
         }
         else
         {
            perror("fread");
            delete [] ret;
            ret = NULL;
         }
      }
      else perror("fseek");

      fclose(fpIn);
   }
   return ret;
}

// Gotta provide our own equality-testing function otherwise unordered_set will just compare pointer values
struct CharPointersEqualityFunction : public std::binary_function<char *, char *,bool>
{  
    bool operator() (char * s1, char * s2) const {return strcmp(s1, s2) == 0;}
};

// Gotta provide our own hashing function otherwise unordered_set will just hash the pointer values
struct CharPointerHashFunction
{
   int operator() (char * str) const
   {
      // djb2 by Dan Bernstein -- fast enough and simple enough
      unsigned long hash = 5381;
      int c; while((c = *str++) != 0) hash = ((hash << 5) + hash) + c;
      return (int) hash;
   }
};

typedef std::unordered_set<char *, CharPointerHashFunction, CharPointersEqualityFunction > CharPointerUnorderedSet;

int main(int argc, char ** argv)
{
   if (argc < 2)
   {
      printf("Usage:  ./split_words filename\n");
      return 10;
   }    

   unsigned long arraySizeBytes;
   char * buf = LoadFile(argv[1], &arraySizeBytes);
   if (buf == NULL)
   {
      printf("Unable to load input file [%s]\n", argv[1]);
      return 10;
   }

   CharPointerUnorderedSet set;
   set.reserve(100000);  // trying to size (set) big enough that no reallocations will be necessary during the parse

   struct timeval startTime;
   gettimeofday(&startTime, NULL);

   // The actual parsing of the text is done here
   int wordCount = 0;
   char * wordStart = buf;
   char * wordEnd   = buf;
   char * bufEnd    = &buf[arraySizeBytes];
   while(wordEnd < bufEnd)
   {
      if (isspace(*wordEnd))
      {
         if (wordEnd > wordStart)
         {
            *wordEnd = '\0';
            set.insert(wordStart);
            wordCount++;
         }
         wordStart = wordEnd+1;   
      }
      wordEnd++;
   }

   struct timeval endTime;
   gettimeofday(&endTime, NULL);

   unsigned long long startTimeMicros = (((unsigned long long)startTime.tv_sec)*1000000) + startTime.tv_usec;
   unsigned long long endTimeMicros   = (((unsigned long long)  endTime.tv_sec)*1000000) + endTime.tv_usec;
   double secondsElapsed = ((double)(endTimeMicros-startTimeMicros))/1000000.0;

   printf("Parsed %i words (%zu unique words) in %f seconds, aka %.0f words/second\n", wordCount, set.size(), secondsElapsed, wordCount/secondsElapsed);
   //for (const auto& elem: set) printf("word=[%s]\n", elem);

   delete [] buf;
   return 0;
}
Jeremy Friesner
  • 70,199
  • 15
  • 131
  • 234
2

Your largest cost, as suspected, is in push_back causing frequent reallocations as you append one character at a time. Why not search ahead, then append all of the data at once using QString::mid():

slova.insert(s_body.mid(beginPos, i - beginPos - 1));

Where beginPos holds the index of the start of the current substring. Instead of appending each character to current before it is inserted into slova, the copy happens all at once. After copying a substring, search ahead for the next valid (not a separator) character and set beginPos equal to that index.

In (rough) code:

QString s_body = ...
//beginPos tells us the index of the current substring we are working 
//with. -1 means the previous character was a separator
int beginPos = -1;
for (...) {
    //basically your if statement provided in the question as a function
    if (isSeparator(s_body[i])) {
         //ignore double white spaces, etc.
         if (beginPos != -1) {
             mutex.lock();
             slova.insert(s_body.mid(beginPos, i - beginPos - 1));
             mutex.unlock();
         }
    } else if (beginPos == -1)
        //if beginPos is not valid and we are not on a separator, we 
        //are at the start of a new substring.
         beginPos = i;
}

This approach will drastically reduce your overhead in heap allocations and eliminate QString::push_back() calls.

One final note: QByteArray also provides a mid() function. You can skip the conversion to QString entirely and work directly with the byte array.

jonspaceharper
  • 4,207
  • 2
  • 22
  • 42
  • It's a small change and most compilers optimize it away, but `++i` is slightly faster than `i++`, as it does not create a temporary. – jonspaceharper Apr 29 '16 at 11:46