2

The way how indexing in a database works: refering the answer from Xenph Yan

Creating an index on a field in a table creates another data structure which holds the field value, and pointer to the record it relates to. This index structure is then sorted, allowing Binary Searches to be performed on it.

The way I understood ORC indexing is, that ORC keeps statistics (min, max, sum) about the rows every 10'000 rows (by default )and if I query the data it looks at the statistics to figure out if it needs to read the row chunk or not.

So is it correct that ORC indexing does not sort the data?

I have a 69 column large table with very unstructured data and I would like to be able to perform ad-hoc queries on every column. To do so, I would like to be able to sort every column through an index (or at least most of them). There is no 'key' column in the data that get's queried rapidly.

Community
  • 1
  • 1
PhanThomas
  • 583
  • 3
  • 10

2 Answers2

1

As per the Cloudera support forum:

ORC indexes come in two forms, the standard indexes which are created all the time ( min/max values for each stride for each column ) and Bloom filters.

Normal indexes are good for range queries and work amazingly well if the data is sorted. This is normally automatic on any date column or increasing columns like ids.

Bloom filters are great for equality queries of things like URLs, names, etc. on data that is not sorted. ( I.e. a customer name can happen sometimes in the data ).

However, Bloom filters take some time to compute, take some space in the indexes and do not work well for most columns in a data warehouse ( number fields like profit, sales, ... ) So they are not created by default and need to be enabled for columns:

orc.bloom.filter.columns

The stride size means the block of data that can be skipped by the ORC reader during a read operation based on these indexes. 10000 is normally a good number and increasing it doesn't help you much.

http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data

Daniel Sobrado
  • 717
  • 1
  • 9
  • 22
-2

Hive has been designed as a pseudo-SQL front-end for running (long) batch jobs on (massive) data sets. You can run "ad hoc queries" but forget about "rapidly".

Besides, when you index a column in a database (i.e. CREATE INDEX command in SQL), what you index is the entire, exact value of each row. If your data is indeed "unstructured" that would probably make no sense.

So... if what you need is Full-Text Search, why don't you dump all your data in something like ElasticSearch or Solr instead??

Samson Scharfrichter
  • 8,884
  • 1
  • 17
  • 36