1

What are advantages of Cassandra over HBase when it comes to MapReduce jobs?

I have a lot of small files that I would like to move from HDFS to a database and that files would be input for MapReduce jobs. I don't take all files, but for a certain user, so possibly the whole row, at least a column family. I could take files from certain period.

I know that HBase is the Hadoop database, so I expect that integrates good for what I need, but I also read that Cassandra has much better performance. But I would like to know what is the situation when you use it as input for MapReduce jobs. Is the performance still a lot better than in case of HBase?

I must emphasize that I'm not looking for comparison of HBase and Cassandra in general, but in concrete case of MapReduce jobs. Questions like this do not talk concretely about performance in case of MapReduce jobs. Also, I'm looking for fresh information (the question I mentioned is from 2011, I believe there might have been some changes since then).

Community
  • 1
  • 1
Kobe-Wan Kenobi
  • 3,694
  • 2
  • 40
  • 67
  • Thank you for your suggestion, but that question and answers may be a bit outdated (how much changes were made since 2011?). Also, I'm not interested only in comparing databases in general, but would like to know which has better performance in case of MapReduce jobs and there is no information about it. – Kobe-Wan Kenobi Nov 05 '15 at 12:07

1 Answers1

1

Both databases have a great read and write performance. Possibly HBase for bulk reading has a slightly better performances, than Cassandra. But I have two use cases when HBase will work significant faster than Cassandra, due to it design.

First when you need for map reduce only some portion of data based on the column names, e.g. a html pages and some parsed information from it. You put html in one column family, the parsed information in other. The different column families lie in different files in HDFS, so to read only one you will don't need to read other. This gives you significant benefits in performance because, in case when you will need read only parsed data, which a occupied several times less space on disck than html. In case of Cassandra you will need read whole table.

Second when you need access information ordered by row key or some part of table based on this order, e.g . read html page from some domain. In case of HBase you can have a row key as sum of domain and url. HBase have a good balancer for cases of unhashed row keys. But Cassandra have not or you should use some trick for balancing in this case, or will need to scan whole table.

Hope this use cases will give you some picture, when better to use HBase and when Cassandra.

Alexander Kuznetsov
  • 3,062
  • 2
  • 25
  • 29