3

I'm creating a Postgresql database where we have many (10-40) variables that will have boolean values. I'd like to figure out what the best way to store this data is, given moderate numbers of updates and lots of multi-column searches.

It seems pretty straightforward to just create the 30 or so boolean columns and create multi-column indexes where necessary. Alternatively, someone suggested creating a bit string that combines all of the booleans. It seems like the second option should be faster, but the answers other people have given online seem to be contradictory (see below).

Any suggestions or explanations would be helpful. The data is tens of millions of rows, but not larger, and I expect selects to return somewhere between 1/100 to 1/4 of the data.

https://stackoverflow.com/questions/14067969/optimized-sql-using-bitwise-operator

alternative to bitmap index in postgresql

UPDATE:

I found one resource that suggests using ints or big ints if you have more than a few variables (where you should use separate columns) and fewer than 33 or so (where you switch to bitstrings). This seems to be motivated more by storage size than ease of search.

https://dba.stackexchange.com/questions/25073/should-i-use-the-postgresql-bit-string

Community
  • 1
  • 1
ten
  • 115
  • 1
  • 8
  • Do you understand that bitmap indexes have nothing to do with bit strings? – Mike Sherrill 'Cat Recall' Jan 04 '13 at 18:02
  • Yeah, it seems like there are two approaches: 1) lots of columns & bitmap index or 2) one bitstring column. Is that correct? – ten Jan 04 '13 at 18:03
  • No, the two approaches are lots of Boolean columns or one bit string column. AFAIK, only Oracle offers the option of a persistent bitmap index. Others--PostgreSQL, at least--builds bitmap indexes on the fly in the optimizer. Different dmbs offer different ways of indexing low-cardinality values. What are you using? – Mike Sherrill 'Cat Recall' Jan 05 '13 at 01:39

1 Answers1

1

I've found a related discussion at the Database Administrators site.

First, I would define/analyze what is "best" in your context. Are you just looking for speed? What is your search pattern? Is data/disk volume an issue?

What alternatives do you have? Besides of bitstrings, one could use ordinary text strings, integer arrays, and separate columns. To get the data fast, you have to think about indexing. You mentioned multi-column indexes. Would it make sense to store/index the same bit variable in several indices?

40 bits without too many duplicate record means up to 2^20 = 1.1E12 records. This makes a full-table scan a lengthy affair. On the other hand, indexing is not really helpful, if you have a lot of duplicate keys.

If you are expecting a resultset of some 25%, you would have to transfer 2.7E11 (partial) records between database and application. Assuming 10,000 records/s, this would take 7,736 hours or 10 months.

My conclusion is that you should think about storing the data in big BLOBs (1.1E12 x 40 bits is just 40 GByte). You could partition your data, read the interesting part into memory and do the search there. This is more or less what a BigData or Datawarehouse system is doing.

Community
  • 1
  • 1
Axel Kemper
  • 10,544
  • 2
  • 31
  • 54