HBase: How does data get written in a sorted manner into HFile?

Question

I had a fairly basic doubt on HFiles.

When a put/insert request is initiated, the value is first written into the WAL and then into the memstore. The values in the memstore is stored in the same sorted manner as in the HFile. Once the memstore is full, it is then flushed into a new HFile.

Now, I have read that the HFile stores the data in sorted order i.e. the sequential rowkeys will be next to each other.

Is this 100% true?

For example: I first write rows with rowkeys 1 to 1000, except rowkey 500. Assume that the memstore is now full and so it will create a new HFile, call it HFile1. Now, this file is immutable.

Now, I will write rows 1001 to 2000, then I write rowkey 500. Assume that the memstore is full and it writes to a HFile, call it HFile2.

So, is this how it happens?

If yes, then rowkey 500 is not in the HFile1, so the rowkeys in the HFiles are not in sorted order. So, is the original statement in bold correct?

So, when a read happens, how does the read happen?

score 6 · Accepted Answer · edited Jun 20 '20 at 09:12

HFile stores the data in sorted order i.e. the sequential rowkeys will be next to each other.

Is this 100% true?

Yes, this is 100% accurate. RowKeys with in a single HFile are always sorted.

I will write rows 1001 to 2000, then I write rowkey 500. Assume that the memstore is full and it writes to a HFile, call it HFile2.

So, is this how it happens?

Yes, now 500 gets to the top of the second HFile.

If yes, then rowkey 500 is not in the HFile1, so the rowkeys in the HFiles are not in sorted order. So, is the original statement in bold correct?

Yes, row keys with in a single HFile are always sorted. HBase periodically performs compactions which will merge multiple HFiles and rewrite's them to a single HFile, this new HFile which is a result of compaction is also sorted.

So, when a read happens, how does the read happen?

At a read time, if there are more than one HFile for a store, HBase will read that row from all the HFiles (check whether this row is there and if so read) and also from memstore. So it can get the latest data.

HBase Definitive Guide has very good explanation on how HBase Read Path works.

Thanks. The compaction of HFiles happen periodically, which means there is a potential where the rowkey 500 is not in the HFile1 (which has rowkeys 1-1000) and is in HFile 2(which has rowkeys 500 and 1001-2000). Now, as far as I know, when a key is being read, the FileTrailer of the HFile is first read, and then the FileInfo of the HFile is read . The FileInfo will have the 'Last Key' in the HFile and using this info the HFile reader can decide if it has to read the HFile or not to get the rowkey 500. If my understanding is correct, then the HFile1 will be read and HFile2 will not, correct? — user3031097, Nov 04 '14 at 06:23
Each HFile is divided into blocks (default 64KB). Each block contains the actual KV's (data), and there's a block-level bloom filters and indexes from HFile2 (version 2). If you don't have BloomFilter enabled for column family then HBase will try to read all the HFiles and see if a row key exists or not by scanning over the block indexes. — Ashrith, Nov 04 '14 at 18:14

HBase: How does data get written in a sorted manner into HFile?

1 Answers1