Need help about MapViewOfFile performance

Question

I'm very confused by a strange issue I'm having.

Consider the following code:

    testTimer.Start();

    u64 startPositionInBytes = startPosition * sizeof(DBTransaction);
    u64 chunkSizeInBytes = chunkSize * sizeof(DBTransaction);

    u64 viewOffset = (startPositionInBytes / allocGranularity) * allocGranularity;

    u64 mapViewSize = (startPositionInBytes % allocGranularity) + chunkSizeInBytes;
    if (mapViewSize > (queryThread->FileSize - viewOffset))
        mapViewSize = (queryThread->FileSize - viewOffset);

    u64 dataReadOffset = startPositionInBytes - viewOffset;

    DWORD high = (DWORD)(viewOffset >> 32);
    DWORD low = (DWORD)(viewOffset & 0xFFFFFFFF);
    u8* fileData = (u8*)MapViewOfFile(queryThread->FileMap, FILE_MAP_READ, high, low, mapViewSize);
    if (fileData == nullptr)
    {
        RALogError("Unable to create a file view into mapped file at position %llu (chunk size %llu). Error was: %s. Application can not proceed", startPosition, chunkSize, GetLastSystemError().c_str());
    }
    elapsed = testTimer.Measure();
    queryThread->MapGlobalTime += elapsed;

    DBTransaction* transaction = (DBTransaction*)(fileData + dataReadOffset);

    for (u32 i = 0; i < chunkSize; ++i)
    {
        testTimer.Start();
        //DBTransaction* transaction = (DBTransaction*)pBytes;

        queryThread->TotalRead++;

        elapsed = testTimer.Measure();
        queryThread->CopyTime += elapsed;

        testTimer.Start();
        
        bool allConditionsPassed = true;

        for (u32 j = 0; j < QueryFilters.size(); ++j)
        {
            RAQueryFilter* filter = &QueryFilters[j];
            bool atLeastOneTrue = false;
            for (u32 k = 0; k < filter->Conditions.size(); ++k)
            {
                QueryCondition* condition = &filter->Conditions[k];

                if (condition->Compare(transaction, condition))
                    atLeastOneTrue = true;

                if (atLeastOneTrue)
                    break;
            }

            if (atLeastOneTrue == false)
            {
                allConditionsPassed = false;
                break;
            }
        }
        elapsed = testTimer.Measure();
        queryThread->MatchGlobalTime += elapsed;
        
        testTimer.Start();
        if (allConditionsPassed)
        {
            queryPointer->AddToResult(queryThread, queryPointer, transaction);
            queryThread->TotalMatches++;
        }
        elapsed = testTimer.Measure();
        queryThread->ComputeGlobalTime += elapsed;

        transaction ++;
    }

This is a thread that reads a chunk of file mapped on memory. There are 10 of them reading from different chunks of the same file. If the file is small (2 GB) this code can read the whole thing and find all matches in almost no time.

When the size of the file increases the time required increases as well (which is expected), but it increases exponentially and not linearly as I would expect. Working with a 12GB file (20+ million transactions) the whole process takes 26 seconds which is insane.

I decided to profile the code in the thread and here is my confusion: MapViewOfFile takes a bunch of milliseconds so it´s not the bottleneck. The time is almost entirely spent in the function Compare, but if I comment out this function (all transactions match), the time is exactly the same but it´s spent in AddToResult.

If I also comment out this function and copy the transaction from the buffer into a local variable (and then I add a random counter used elsewhere to avoid the compiler optimizing out the whole code), the time is yet again the same but is spent in the copy.

My guess is that the operation taking the time is actually the dereferencing of the pointer to DBtransaction, whenever it happens first.

I'm not sure about what is going on. Can anyone explain, and provide some suggestions for how to fix the issue?

Here is the code that spawn the threads:

u32 requiredThreads = AvailableThreadsCount;
    u64 transactionsPerThread = (u64)ceil((f32)dataToRead / (f32)requiredThreads);

    // Make sure to not make thread load too small. If the number of transactions is less than 500K, reduce the number of used threads
    if (transactionsPerThread < RA_MINIMUM_THREAD_TRANSACTION_COUNT)
    {
        requiredThreads = (u32)ceil((f32)dataToRead / (f32)RA_MINIMUM_THREAD_TRANSACTION_COUNT);
        transactionsPerThread = RA_MINIMUM_THREAD_TRANSACTION_COUNT;
    }



    // Spawn the threads
    u32 i = 0;
    while (dataToRead > 0)
    {
        u64 chunkSize = dataToRead;
        if (chunkSize > transactionsPerThread)
            chunkSize = transactionsPerThread;

        RA_ASSERT(i < requiredThreads, "Too many threads created!");
        sQueryThread* queryThread = &QueryThread[i];

        #ifdef USE_FILE_MAPPING
        queryThread->Set(startPosition, chunkSize, testReadChunk, TransactionsHistory->FileSizeInBytes, TransactionsHistory->ReadFileMapHandle);
        #else 
        queryThread->Set(startPosition, chunkSize, 0, NULL);
        #endif

        //RALog("Spawning thread %i", i);
        queryThread->Handle = CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)DoLinearSearchQueryThread, queryThread, 0, &queryThread->ThreadID);

        i++;
        startPosition += chunkSize;
        dataToRead -= chunkSize;
    }

I basically split the file in chunks and assign one of them to each thread

I don't see any thread synchronization in the code you posted. Could this possibly be [UB](https://en.cppreference.com/w/cpp/language/ub) due to a data race? — Jesper Juhl, Jun 13 '23 at 02:46
@JesperJuhl I edited the question adding the code that spawns the threads. The only function that writes something is AddToResult. it just sums a bunch of data from the transactions and I use Interlock operations everywhere, so I do not think is a race condition issue — Pierluigi Serra, Jun 13 '23 at 03:01
`ceil((f32)dataToRead / (f32)requiredThreads)` this is a bad way to round up. Just use `(dataToRead + requiredThreads - 1)/requiredThreads`, no need to convert to/from floating. See [Rounding integer division (instead of truncating)](https://stackoverflow.com/a/2422722/995714) — phuclv, Jun 13 '23 at 03:34
It is not all that surprising to me, what you are running into is that you run out of physical memory. MapViewOfFile uses virtual memory managment so it start trashing. WIth enough threads accessing different parts of the file the OS is starting to swap pages in and out of memory very quickly. So you really really need to make sure that when using multiple threads the data they work in can stay in physical memory do not use too many threads, and only allow your code to work on a "window" (or buffer) of the file that fits in physical memory. — Pepijn Kramer, Jun 13 '23 at 03:40
Notes : [trashing](https://en.wikipedia.org/wiki/Thrashing_(computer_science)). And use std::thread for your threads (don't use the windows API). And then you need to profile to find out what the optimal number of threads and buffer/window size are the optimal ones. Because you cannot blindly throw N threads at a problem and expect it to run N times as fast. Due to things like data/cache coherency, predictability of data access in memory (so the CPU can efficiently prefetch) etc.. etc.. — Pepijn Kramer, Jun 13 '23 at 03:41
You can verify the application is trashing by opening perfmon and checking the number of page faults per second. — Pepijn Kramer, Jun 13 '23 at 03:47

score 0 · Answer 1 · answered Jun 13 '23 at 04:19

MapViewOfFile uses virtual memory managment so it start trashing. With enough threads accessing different parts of the file the OS is starting to swap pages in and out of memory very quickly. So you really need to make sure that when using multiple threads the data they work on can stay in physical memory to avoid trashing from happening.

Need help about MapViewOfFile performance

1 Answers1