1

It is known that database indexing only makes sense if you have large tables and more reads than writes as the creation of the indices leads to additional writing overhead as each modification in the database also leads to a modification of the indices.

If we assume that the indexing structure of the database is a) a B+ tree b) a hashtable what is a rule of thumb for the number of reads compared to the number of writes where it starts to make sense to implement database indexing ?

For information on how database indexing works, check out How does database indexing work?

Community
  • 1
  • 1
scs
  • 567
  • 6
  • 22
  • 3
    This can't be answered. If there is one really important query that benefits from indexing it might be worth taking the hit when writing data - regardless on how big that slowdown is. And it also depends on the DBMS being used. –  Dec 27 '16 at 09:01

3 Answers3

2

There are many factors involved here. For example, is the index data in the cache? Are the data blocks in the cache? How many rows are retrieved by the query? Hoe many rows in the table etc etc I have been asked this question (or a variant of it) many times. Rather than give number, I prefer to give some information so that the person asking the question can make an informed decision. Here is a "back of the envelope" calculation example with some stated assumptions.

  1. Lets say the table has 100M rows
  2. Lets say a row is 200 bytes on average
    • Than means the table consumes about 19GB (ignoring any compression)
  3. Lets assume that the index is fully cached and takes zero time
  4. Lets assume that the data has to be read from disk, at 5ms per read
  5. Lets assume that your IO subsystem can deliver data at 10 GB/s
  6. Lets assume that you need to read 50,000 rows from your table. Lets also assume that the rows are spread out such that two of the required rows live in te same block.

    OK, now we can do some math.

    Scenario 1. Using an index

    • Index reads are zero time
    • Data block reads: 25,000 blocks @5ms each is 125s

    Scenario 2. Scanning the table

    • 19GB scanned at 10GB/s is 2s

    Therefore in this case, scanning the data is much faster than using an index.

BobC
  • 4,208
  • 1
  • 12
  • 15
2

Costs formula

Nothing:

  • insert: O(1)
  • search O(n)
  • cost search = n db lookup operations
  • cost insert = db insert

a) B+ tree of order b

  • insert: O (log n)
  • search: O (log n)
  • cost search = log_b n lookup operations + db lookup
  • cost insert = log_b n lookup operations + db insert
  • with increasing order, the number of lookup operations go down, but the cost per lookup operation increases

b) Hashtable

  • insert: O(1)
  • search: O(1)
  • cost search = hash calculation [+ extra buckets when collisions happen] + db lookup
  • cost insert = hash calculation [+ extra buckets when collisions happen] + db insert

Cost at n = 1000

Nothing:

  • cost search = 1000 db lookup operations
  • cost insert = 1 db insert

a1) B+ tree of order 2

  • cost search = 10 lookup operations + 1 db lookup
  • cost insert = 10 lookup operations + 1 db insert

a2) B+ tree of order 10

  • cost search = 3 lookup operations + 1 db lookup
  • cost insert = 3 lookup operations + 1 db insert

b) Hashtable

  • cost search = hash calculation [+ extra buckets when collisions happen] + db lookup
  • cost insert = hash calculation [+ extra buckets when collisions happen] + db insert

Cost at n = 1000000

Nothing:

  • cost search = 1000000 db lookup operations
  • cost insert = 1 db insert

a1) B+ tree of order 2

  • cost search = 20 lookup operations + 1 db lookup
  • cost insert = 20 lookup operations + 1 db insert

a2) B+ tree of order 10

  • cost search = 6 lookup operations + 1 db lookup
  • cost insert = 6 lookup operations + 1 db insert

b) Hashtable

  • cost search = hash calculation [+ extra buckets when collisions happen] + db lookup
  • cost insert = hash calculation [+ extra buckets when collisions happen] + db insert

There are also quite large costs gains to be gotten by hitting the hardware caches for subsequent hits. The exact gains depend on the hardware your database is installed on.

The costs are not all easily comparable. The hash calculation is generally more expensive than the lookups, but as n gets large it stays the same. B tree lookup and sequential DB lookups are likely to hit hardware caches (due to entire blocks being loaded).

The larger a table, the more important it is to have an index (see cost at n=1000 vs n=1000000). So the number of writes vs reads will vary with the size of your table.

You should also take your specific queries into account. For example, hash tables are not ordered while B trees are. So if a query needs to pick up all values between a minimum and a maximum value, a B tree would perform better than a hash table (a good hash is uniformly distributed).

In general, you would have to measure performance of the queries and inserts you would be using in practice. You could start without an index and then add one when you need it. Indexes can be added later without having to change the queries of the programs using your DB (the response time of the queries will change though, reads will get faster if they use the index, writes will get slower).

If you are in the specific case where you have a load operation with a lot of inserts followed by a period of reads, you can temporarily turn off the index and recalculate it after the load.


References:

B trees vs hash tables:

https://cs.stackexchange.com/questions/270/hash-tables-versus-binary-trees

Cache

http://www.moreprocess.com/devices/computer-memory-hierarchy-internal-register-cache-ram-hard-disk-magnetic-tape

Community
  • 1
  • 1
HSquirrel
  • 839
  • 4
  • 16
  • Reason why this question only got an upvote instead of a bounty: Complexity analysis is not always helpful for the decision if it makes sense to implement an index. In some cases it is faster to sequentially read the whole data. In addition, the complexity analysis of B-Trees should mention the search of an entry in the nodes with o(log2(n)) complexity. – scs Jan 03 '17 at 16:08
1

the point of using database is usually to cut the time spent looking for something in the data, therefore it is about "reading". I am very sure that 90% people use database for "reading", if not 100%.

let's think of a few cases:

  • writes only, no reads: what's the point? just for backup? well, the backup will be used for reads when it is restored.
  • many writes, few reads: i can't think of such situation, but when you do have this situation, which one do you value more? swiftly presenting the data? or how fast you save the data? if you value how fast data is saved, then you can remove the index and have a night run to present the data (i'm sure it's big enough to need a night run to summarize the data considering how fast the data is being saved), therefore you have a near real time writes, and H+1 or more late data reads, which bring to question: what's the point of having such configuration?
  • few writes, many reads: this is the most common situation
  • no writes, only reads: 100% you can't produce this situation, no writes then what to read?

sorry for my english, hope it helps

am05mhz
  • 2,727
  • 2
  • 23
  • 37
  • "Many writes, few reads" tables have become very common in recent years; people want statistics everywhere, they want to maintain logs of changes and even accesses, and so on. Indexes on these tables tends to weight more than the content itself and significantly penalize EVERY user events. Yet, very few users actual read these tables, and being moderately slow is often an acceptable tradeoff. – James Jan 03 '17 at 01:33
  • As for the "no writes only reads" tables, there are actually use cases for these; it is very common in relational databases, for example, to have read only tables, to support integrity checks on foreing keys or to provide some constant values for use by stored procedures and functions. Some RDBMS even provide highly optimized indexes for read only tables... – James Jan 03 '17 at 02:12
  • @jwatkins read only tables still do writes isn't it? the difference is just who is writing to the table. the RDBMS is still writing to the table and still got penalties from the index. – am05mhz Jan 03 '17 at 06:42
  • @jwatkins and for the logging process, i am well aware of the use case, i am just not seeing the benefit of dropping the index. fast logging is indeed good, but when the time come to dig in the data, do you have the time to wait for the query? searching a few thousand of data in billions of data? considering you need to save data in real time, it might be trillions if not more. in theory, you could do that with fast I/O, but in my opinion, when I have fast I/O, i would not concern about writing speed – am05mhz Jan 03 '17 at 06:54