Writer/Reader buffer mechanism for large size - high freq data c++

Question

I need a single writer and multiple reader (up to 5) mechanism that the writer pushes the data of size almost 1 MB each and 15 packages per second continuously which will be writtern in c++. What I’m trying to do is one thread keeps writing the data while 5 readers are going to make some search operations according to the timestamp of the data simultaneously. I have to keep each data package 60 min, and then they can be removed from the container.

Since the data can grow like 15 MB * 60 sec * 60 min = 54000MB/h I need almost 50 GB space to keep the data and make the operations fast enough for both the writer and the readers. Bu the thing is we cannot keep that size data on cache or RAM so it must be in a Hard drive like SSD (HDD would be too slow for that kind of operation)

Up to now what I’ve been thinking is, to make a circular buffer (since I can calculate the max size) directly implemented to an SSD, which I couldn’t find a suitable example up to now and I don’t know if it is possible or not either, or to implement some kind of mapping mechanism that one circular array will be available in the RAM that just keeps the timestamps of the data and physical address of the memory for searching the data which is available on the hard drive. So at least the search operations would be faster I guess.

Since any kind of lock, mutex or semaphore will slow down the operations (especially write is critical we cannot loose data because of any read operation) I don’t want to use them. I know there are some shared locks available but I think again they have some drawbacks. Is there any way/idea to implement such kind of system with lock free, wait free and thread safe as well? Any Data structure (container), pattern, example code/project or other kind of suggestions will be highly appreciated, thank you…

EDIT: Is there any other idea rather than bigger amount of RAM?

If the upper limit is really 50gb, I'm pretty sure you can get a machine that handles that, with ease. 128gb is pretty standard for heavy workstations. — stijn, Nov 19 '13 at 09:27
If your memory will be limited, then just keep the last few seconds in RAM, and flush the other to disk while using [memory-mapped files](http://en.wikipedia.org/wiki/Memory-mapped_file). — Some programmer dude, Nov 19 '13 at 09:27
@JoachimPileborg keeping just few seconds in RAM doesn't make any help. Because the reader threads can demand the data from any time frame not just the last ones. — user2955554, Nov 19 '13 at 09:48

Damon · Answer 1 · 2013-11-19T14:26:29.940

This can be done on a commodity PC (and can scale to a server without code changes).

Locks are not a problem. With a single writer and few consumers that do time-consuming tasks on big data, you will have rare locking and practically zero lock contention, so it's a non-issue.
Anything from a simple spinlock (if you're really desperate for low latency) or preferrably a pthread_mutex (which falls back to being a spinlock most of the time, anyway) will do fine. Nothing fancy.

Note that you do not acquire a lock, receive a megabyte of data from a socket, write it to disk, and then release the lock. That's not how it works.
You receive a megabyte of data and write it to a region that you own exclusively, then acquire a lock, change a pointer (and thus transfer ownership), and release the lock. The lock protects the metadata, not every single byte in a gigabyte-sized buffer. Long running tasks, short lock times, contention = zero.

As for the actual data, writing out 15MiB/s is absolutely no challenge, a normal harddisk will do 5-6 times as much, and a SSD will easily do 10 to 20 times that. It also isn't something you even need to do yourself. It's something you can leave to the operating system to manage.

I would create a 54.1GB¹ file on disk and memory map that (assuming it's a 64bit system, a reasonable assumption when talking of multi-gigabyte-ram-servers, this is no problem). The operating system takes care of the rest. You just write your data to the mapped region which you use as circular buffer².
What was most recently written will be more or less guaranteed³ to be resident in RAM, so the consumers can access it without faulting. Older data may or may not be in RAM, depending on whether your server has enough physical RAM available.

Data that is older can still be accessed, but likely at slightly slower speed (if there is not enough physical RAM to keep the whole set resident). It will however not affect the producer or the consumers reading the recently written data (unless the machine is so awfully low-spec that it can't even hold 2-3 of your 1MiB blocks in RAM, but then you have a different problem!).

You are not very concrete on how you intend to process data, other than there will be 5 consumers, so I will not go too deep into this part. You may have to implement a job scheduling system, or you can just divide each incoming block in 5 smaller chunks, or whatever -- depending on what exactly you want to do.

What you need to account for in any case is the region (either as pointer, or better as offset into the mapping) of data in your mapped ringbuffer that is "valid" and the region that is "unused".
The producer is the owner of the mapping, and it "allows" the consumers to access the data within the bounds given in the metadata (a begin/end pair of offsets). Only the producer may change this metadata. Anyone (including the producer) accessing this metadata needs to acquire a lock.

It is probably even possible to do this with atomic operations, but seeing how you only lock rarely, I wouldn't even bother. It's a no-brainer using a lock, and there are no subtle mistakes that you can make.

Since the producer knows that the consumers will only look at data within well-defined bounds, it can write to areas outside the bounds (the area known being "emtpy") without locking. It only needs to lock to change the bounds afterwards.

As 54.1Gib > 54Gib, you have a hundred spare 1MiB blocks in the mapping that you can write to. That's probably much more than needed (2 or 3 should do), but it doesn't hurt to have a few extra. As you write to a new block (and increase the valid range by 1), also adjust the other end of the "valid range". That way, threads will no longer be allowed to access an old block, but a thread still working in that block can finish its work (the data still exists).
If one is strict about correctness, this may create a race condition if processing a block takes extremely long (over 1 1/2 minutes in this case). If you want to be absolutely sure, you'll need another lock which may in the worst case block the producer. That's something you absolutely didn't want, but blocking the producer in the worst case is the only thing that is 100% correct in every contrieved case unless a hypothetical computer has unlimited memory.
Given the situation, I think this theoretical race is an "allowable" thing. If processing a single block really takes that long with so much data steadily coming in, you have a much more serious problem at hand, so practically, it's a non-issue.

If your boss decides, at some point in the future, that you should keep more than 1 hour of backlog, you can enlarge the file and remap, and when the "empty" region is next at the end of the old buffer's size, simply extend the "known" file size, and adjust your max_size value in the producer. The consumer threads don't even need to know. You could of course create another file, copy the data, swap, and keep the consumers blocked in the mean time, but I deem that an inferior solution. It is probably not necessary for a size increase to be immediately visible, but on the other hand it is highly desirable that it is an "invisible" process.
If you put more RAM into the computer, your program will "magically" use it, without you needing to change anything. The operating system will simply keep more pages in RAM. If you add another few consumers, it will still work the same.

¹ Intentionally bigger than what you need, let there be a few "extra" 1MiB blocks.

² Preferrably, you can madvise the operating system (if you use a system that has a destructive DONT_NEED hint, such as Linux) that you are no longer interested in the contents before overwriting a region. But if you don't do that, it will work either way, only slightly less efficient because the OS will possibly do a read-modify-write operation where a write operation would have been enough.

³ There is of course never really a guarantee, but it's what will be the case anyway.

Thank you very much for your detailed reply. But since the explanation is very abstract I'm confused in some points about implementing it. Do you know any kind of example implementation, document, article, code sample etc... to make it more concrete? — user2955554, Nov 19 '13 at 14:53
Finding an example implementation will likely be hard, since it's a rather specific problem. But you will likely find examples for most of the components you need in your implementation. Memory-mapping is somewhat system specific and somewhat low-level, so one would need to know at least what operating system or what class of operating system (eg. Linux, BSD, Windows server) you intend to run this on. I've kind of assumed that "server" == "POSIX" when I said e.g. "pthread_mutex", but you can of course generally do the same stuff on every operating system. — Damon, Nov 19 '13 at 15:03
Generally, the approach is dead simple. You use a ringbuffer that is somewhat larger than the 1-hour lookback, and you protect access to the ringbuffer's metadata (the begin/end fields) with a mutex. The only other thing that is "special" is that you place it into a memory-mapped region, which requires using an operating system specific call (such as mmap for POSIX and POSIX-like systems, or CreateFileMapping/MapViewOfFile for Windows systems). These are well-documented, but as mentioned above, one would first need to know what system you target. — Damon, Nov 19 '13 at 15:06
The OS will be definitely Windows and most probably 7 x64. I'm not very familiar with memory mapiing so researching o that would be a good start point i guess. What about searching an data package according to the Timestamp attribute (which is coming with the data itself). Is it easy to implement this kind of search mechanism over the items? Any tips for that? — user2955554, Nov 19 '13 at 15:18
For Windows, you'll need to know these functions for memory mapping: [1](http://msdn.microsoft.com/en-us/library/windows/desktop/aa363858%28v=vs.85%29.aspx), [2](http://msdn.microsoft.com/en-us/library/windows/desktop/aa366537%28v=vs.85%29.aspx), [3](http://msdn.microsoft.com/en-us/library/windows/desktop/aa366761%28v=vs.85%29.aspx) as well as this one [4](http://msdn.microsoft.com/en-us/library/windows/desktop/ms683472%28v=vs.85%29.aspx) and the ones mentioned in "Remarks" for the spinning mutex if you don't have C++11 thread functions. — Damon, Nov 19 '13 at 15:30
About your timestamp attribute, that's impossible to answer since I don't know anything about what your data looks like. Is there a timestamp coming with each 1MiB block, or are there many timestamps on smaller entities within each block, is every block separate or are they one "huge stream", etc etc. But in general, this shouldn't matter a lot for the problem of receiving, storing, and accessing the data. This is more about what you do with your data. — Damon, Nov 19 '13 at 15:33

score 0 · Answer 2 · answered Nov 19 '13 at 09:33

0

54GB/hour = 15MB/s. Good SSD these days can write 300+ MB/s. If you keep 1 hour in RAM and then occasionally flush older data to disk, you should be able to handle 10x more than 15MB/s (provided your search algorithm is fast enough to keep up).

Regarding fast locking mechanism between your threads, I would suggest looking into RCU - Read-Copy Update. Linux kernel is currently using it to achieve very efficient locking.

answered Nov 19 '13 at 09:33

mvp

111,019
13
122
148

Actually the system should work for different size and freqs. I mean the data/sec can be bigger for other implementations so it must be generic. But you're right SSD's are pretty enough to fill the bill. But I'm not sure that SSD can be used like RAM that I can implement any kind of contaioner structure and make all the read and write operations directly over it. And since 1 hour data will be over 50 GB it is impossible to keep it in RAM, right? – user2955554 Nov 19 '13 at 09:42
You can find motherboard that [supports 128GB RAM](http://www.newegg.com/Product/Product.aspx?Item=N82E16813131861). It does not come cheap, but your total computer should not be more than $2000 – mvp Nov 19 '13 at 09:51
@mvp: actually heavyweight database servers often skirt with 1TB/2TB nowadays (not cheap certainly, but man-days of work are not cheap either). – Matthieu M. Nov 19 '13 at 10:11

Frerich Raabe · Answer 3 · 2013-11-19T10:03:08.330

0

Do you have some minimum hardware requirements? 54GB in memory is perfectly possible these days (many motherboards can take 4x16GB these days, and that's not even server hardware). So if you want to require an SSD, you could maybe just as well require a lot of RAM and have an in-memory circular buffer as you suggest.

Also, if there's sufficient redudancy in the data, it may be viable to use some cheap compression algorithms (those which are easy on the CPU, i.e. some sort of 'level 0' compression). I.e. you don't store the raw data, but some compressed format (and possibly some index) which is decompressed by the readers.

edited Nov 19 '13 at 10:03

answered Nov 19 '13 at 09:36

Frerich Raabe

90,689
19
115
207

There is not such strict hardware requirements but I can not just keen on the RAM. Because the system shall be modified further, for different and possibly bigger size data and higher frequencies. So using a HDD or at least a SSD is sufficient. – user2955554 Nov 19 '13 at 09:46
2

@user2955554: 64GB of RAM is something that a commodity PC can handle. At the company I work at we have some heavyweight database servers with 512GB or 1TB of RAM. It will probably be cheaper (long-term) to have a straightforward implementation and more RAM. – Matthieu M. Nov 19 '13 at 10:09

score 0 · Answer 4 · answered Nov 19 '13 at 14:31

0

Many good recommendations around. I'd like just to add that for circular buffer implementation you can have a look at Boost Circular Buffer

answered Nov 19 '13 at 14:31

Andriy Tylychko

15,967
6
64
112

Writer/Reader buffer mechanism for large size - high freq data c++

4 Answers4