3

I have created an application sold to customers, some of which are hardware manufacturers with fixed constraints (slow CPU). The app has to be in java, so that it can be easily installed as a single package.

The application is multithreaded and maintains audio records. In this particular case all we have is INSERT SOMEDATA FOR RECORD, each record representing an audio file (and this can be done by different threads), and then later on we have SELECT SOMEDATA WHERE IDS in (x, y, z) by an single thread, then 3rd step is we actually DELETE all the data in this table.

The main constraint is cpu, slow single cpu. Memory is also a constraint, but only in that the application is designed so it can process an unlimited number of files, and so even if had lots of memory would eventually run out if all stored in memory rather than utilizing the disk.

In my Java application I started off using the H2 database to store all my data. But the software has to run on some slow single cpu servers so I want to reduce the cpu cycles used, and one area I want to look again is the database.

In many cases I am inserting data into database simply for the purposes of keeping the data off the heap otherwise would run out of memory, then later on we retrieve the data, we never have to UPDATE the data.

So I considered using a cache like ehCache but that has two problems:

  • It doesn't guarantee the data will not be thrown away (If the cache gets full)
  • I can only retrieve records one at a time, whereas with relational database I can retrieve a batch of records, this looks like a potential bottleneck.

What is an alternative that solves these issues ?

tkruse
  • 10,222
  • 7
  • 53
  • 80
Paul Taylor
  • 13,411
  • 42
  • 184
  • 351
  • What will happen if you store the data in Apache Solr or simply using Lucene only to retrieve the data based upon query ? Ofcourse it is not relational DB approach. – Sambit May 20 '20 at 07:02
  • @Sambit its an idea I hadnt considered but seems a bit more complex then I was envisaging, and Lucen queries are usually for picking top hits rather than all hits – Paul Taylor May 20 '20 at 08:53
  • How would you design such a solution yourself? You need a disk storage which needs some sort of indexing. Sounds exactly like some lightweight RDBMS. – Janez Kuhar May 20 '20 at 09:14
  • Looking at EhCache I could resolve issue 1 by making a very large disk cache, but I am still concerned there is no way to retrieve records efficiently. So if i want to retrieve 50,0000 I would have to call one by one. Bu the big advantage over relational database is no locking when I write the records because if thread a is working on records 1-5 i can guarantee no other thread will be. So relational database seems like overkill – Paul Taylor May 27 '20 at 16:19
  • 1
    Some possible candidates [here](https://stackoverflow.com/questions/22186175/java-embedded-library-on-disk-key-value-database). – Olivier May 27 '20 at 16:39
  • Thankyou MapDb does sound like the kind of thing I was looking for, I wil investigate further. – Paul Taylor May 28 '20 at 06:46
  • It's not clear from what you're describing that anything's really wrong with H2 for this task. Alternatively, why not just write the data to the filesystem yourself? – dimo414 May 31 '20 at 06:45
  • @dimo414 When running on a low powered single cpu machine there is alot of memory (e.,g prepared statement caching) and cpu overhead by having h2 running, I think it could be replaced by something much simpler. – Paul Taylor May 31 '20 at 06:53
  • And just writing to flat files on-disk isn't an option because? – dimo414 May 31 '20 at 07:45
  • Well for the same reason that people use caching solutions such as Ehcache as well as database, its slower to just read and write to disk all the time. Especially if reading a file with many records and I want to retrive a particular record it would have to scan file from start looking for record so would be incredibly slow. – Paul Taylor May 31 '20 at 08:56

4 Answers4

2

You want to retrieve records in batch fast, not loose any data, but you don't need optimized queries nor updates and you want to use CPU and memory resources as effectively as possible:

Why don't you simply store your records in a file? The operating system uses any free memory for caching. So when you access your file frequently, the OS will do its best to keep as much content as possible in memory. The OS does this job anyway, so this type of caching costs you no additional CPU and no single line of code.

The only scenarios where it could make sense to invest more in optimization would be:

  • a) Your process or other processes make heavy use of the file system and pollute file cache
  • b) Serialization / deserialization is too expensive

In case of a):

Define your priorities. An explicit cache (in heap or off-heap) can help you to keep some content of selected files in memory. But this memory will not be avalaible anymore for the OS's file cache. So while you speed up one file access you potentially slow down access to other files.

In case of b):

Measure performance first, before you optimize anything. Usually disk access is the bottleneck - that's something you cannot change without replacing hardware. If you still want to optimize (e.g. because GC eats up CPU due to a very high number of temporarily created objects - i guess with only one core serial GC will be in use) then I suggest to have a closer look on Google flatbuffers.

You started with the most complex solution for your problem, a database. I suggest to start at the other end of the spectrum and keep it as simple as possible


UPDATE: The question has been edited in the meanwhile and requirements have changed. A new requirement is now that it has to be possible to read selected records by IDs.

Possible extensions:

  • Store each record in an own file and use the key as file name
  • Store all records in one file and use a file-based HashMap implementation like MapDB's HTreeMap implementation.

Independent from the chosen extension, the operating system's file cache will do its best to hold as much content as possible in main memory.

rmunge
  • 3,653
  • 5
  • 19
  • 'You started with the most complex solution for your problem, a database. I suggest to start at the other end of the spectrum and keep it as simple as possible', yes I agree with you in principle. But because I need to read and WRITE data the OS is not going to be able to keep the data in memory. I think I just need to replace database with cache, but Im looking for a cache that will let me retrieve recvords in batch (keyed by ids but not neccessarily together in file), and I still dont have that part of the puzzle. – Paul Taylor Jun 03 '20 at 06:49
  • Why should the OS not be able to cache because of a WRITE? What do you mean with "keyed by ids but not neccessarily together in file"? You stated that you don't need queries and that you want to retrieve records in batch. – rmunge Jun 03 '20 at 17:53
  • Everytime the data is written that invalidates any existing OS cache. Each record has an id,i.e 1,2,3,4,5,6 but the records will not be written to file in specific order, may and need to retrieve 2,4,6 together for example. – Paul Taylor Jun 04 '20 at 06:38
  • As long as the probability that the last written records are read is not lower than the probability that already written records are read, a write should not have any negative impact. If probability should differ, please share the heuristics behind it. This would be important information for the right choice of cache implementation. – rmunge Jun 04 '20 at 08:43
  • thx for the MapDb link Ithink this is what I need to investigate next. – Paul Taylor Jun 04 '20 at 09:24
  • maybe this would work well - https://www.dizitart.org/nitrite-database.html – Paul Taylor Jun 04 '20 at 16:02
  • Just noticed this uses h2 keyvalue store, since Im using h2 maybe I shoulkd go down that route of using mvstore directtly instead of sql – Paul Taylor Jun 04 '20 at 16:04
0

Some of ideas that can help

  1. You say that you're running on a single CPU and want to check a substitution to H2. So, H2 "consumes" a lot of CPU power and the application is claimed to be "slow". But what if its because of slow Disk not a CPU, after all, Databases store their stuff on disks and the disks can be slow. If you want to check this theory - map the disk to some ram backed drive (in linux it's an easy task) and measure again with the same CPU.

  2. If you come to the conclusion that indeed H2 is CPU intensive for use cases, maybe it worth to invest some time to optimize queries, this is much cheaper than substituting the database.

  3. Now, if you can't stay with H2, consider Lucene which is really optimized for this "append-only" use-case (I understand that you have "append-only" flow because you said "later on we retrieve the data, we never have to UPDATE the data). Having said that Lucene also should have its own threads that handle indexing, so some CPU overhead is expected anyway. However, the chances are that Lucene will be faster for this use case. The price is that you won't get "easy" queries, because lucene doesn't implement relational model (well, maybe partially because of that it should be faster), in particular you won't have JOINs, and transaction management. Its possible to query by conditions from a single table like in RDMBS, you don't have to get "top hits" as you describe.

Mark Bramnik
  • 39,963
  • 4
  • 57
  • 97
  • Cpu AND disk are slow, that is one reason why I wouldnt just write the data to a file and read it back. In this particular case all we have is INSERT SOMEDATA FOR RECORD, each record representing an audio file (and this can be done by different threads), and then later on we have SELECT SOMEDATA WHERE IDS in (x, y,z) (this is just one thread) so I dont need any complex queries just want to be able to return a number of rows by id, then 3rd step is we actually DELETE all the data in this table. I've used Lucene before for indexing data and this doesn't really sound like a good fit for it to me. – Paul Taylor May 20 '20 at 11:23
0

From your question and the comments made on Mark Bramniks answer I understood this:

  • CPU constraint: very slow cpu, solution should not be cpu intensive
  • Memory constraint: Not all data can be in memory
  • Disk constraint: very slow disk, solution should not read/write lots of data from disk

These are very strict constraints. Usually you "trade" cpu vs memory or memory vs disk. In your case these are all constraint. You mentioned you looked at ehCache, however I think this solution (and possibly others such as memcached) are not more lightweight than H2.

One solution you could try is MappedByteBuffer. This class makes it possible to have parts of a file in memory and will swap those parts when needed. But this comes at a cost, it is not an easy beast to tame. You will need to write your own algorithm to locate the data you need. Please consider how much time it will take you to get it working vs the additional cost of a bigger machine. Sometimes better hardware is the solution.

Wouter Oet
  • 144
  • 6
  • The main constraint is cpu, slow single cpu. Memory is a constraint, but only in that the application is designed so it can process an unlimited number of files, and so even if had lots of memory would eventually run out if all stored in memory rather than ultizing disk. I wasnt saying that the dis access was particularly slow for this device, just that disc access is significantly slower than RAM access for any device. I would say EhCache is much simpler than H2, but the big problem with it for me is there is no way to retrieve multiple records on one go otherwise it would meet my needs. – Paul Taylor Jun 01 '20 at 07:22
0

Relational databases like Oracle are decades old (41 years), can you imagine how many CPU cycles were available back then? Based on research from 1970 and well understood by professionals, tested, documented, reliable, consistent (checksums), maintainable (backups with zero data loss), performant if used correctly (all kinds of indexes), accessible securely over the network, scalable, etc but apparently Not Invented Here.

Nowadays there are even many free Open Source databases like PostgreSQL that have very modest requirements and the potential to easily implement new requirements in the future (which is hard to predict) and with some effort interchangeable with other databases (JDBC, JPA)

But yes, there is some overhead but typically hardware is cheaper than changing your architecture late in the project and CPU cycles are not an expensive resource anymore (think raspberry pi, smartphones, etc)

JohannesB
  • 2,214
  • 1
  • 11
  • 18
  • Just to be clear this is software sold to consumers so I cant buy there hardware for them. And when i refer to slow single cpu server I mean the software is also provided to a hardware manufacturer of blackbox device, and hence the hardware constraints are fixed I have no say in them, and have to do the best I can with them. – Paul Taylor Jun 01 '20 at 07:16
  • Also solution has to be pure java, so can be easily installed as a single package. The fact is databases are complex beasts, and that comes at a cost and I don't need all that complexity so I was looking for something else. MapDb does sound like the kind of thing I want but I would prefer not to use a one man project. – Paul Taylor Jun 01 '20 at 07:18