2

I have a few large databases, greater than 100 million records. They consist of the following:

  1. A unique key.
  2. An integer value, not unique, but used for sorting the query.
  3. A VARCHAR(200).

I have them in a mysql isam table now. My thought was, hey, I'll just set up a covering index on the data, and it should pull out reasonably fast. Queries are of the form...

select valstr,account 
    from datatable 
    where account in (12349809, 987987223,...[etc]) 
    order by orderPriority;

This seemed OK in some tests, but on our newer installation, its terribly slow. It seems faster to have no index at all, which seems odd.

In any case, I'm thinking, maybe a different database? We use a datawarehousing db for other parts of the system, but its not well suited for anything in text. Any free, or fairly cheap, db's are an option, as long as they have reasonably useful API access. SQL optional.

Thanks in advance.

-Kevin

Community
  • 1
  • 1
Kevin Galligan
  • 16,159
  • 5
  • 42
  • 62
  • 2
    A new database seems like jumping to conclusions. Have you checked the I/O on the database? Perhaps look at index fragmentation? Partitioning strategies? There are lots of things to check first before you ditch what you're currently using. – Daniel DiPaolo Oct 12 '10 at 18:15
  • Haven't yet, at least not too much. If you think that's a good strategy, I'll do that. Just wanted to get some feedback. Since this is read-only data, I was thinking, maybe there's a db that does this great? I'll edit my question (if possible, new to the questions). – Kevin Galligan Oct 12 '10 at 18:18
  • I don't know how to edit. OK, so more detail. This is read-only data. I can index it in any way I want. I'm more than willing to try mysql tweaks. I could also probably sort in-app, if that would help significantly. The queries will usually be between 1,000 and 50,000 in size. Since this may jump all over the index, I don't know what would be the best plan to get performance to a reasonable state (reasonable being relative, obviously, but lets say significantly better than table scan. Actual numbers to follow soon). – Kevin Galligan Oct 12 '10 at 18:21
  • Do you always Order by Orderpriority? If so, it would make sense to add a clustered index on that field. Make another index to cover your selected fields (valstr, account) and see how that impacts performance. – JNK Oct 12 '10 at 18:50
  • How many account numbers are there in the WHERE ACCOUNT IN () clause? If you only have one account number in the WHERE clause how's the performance compared to having 10 account numbers? – Bob Jarvis - Слава Україні Oct 12 '10 at 19:27
  • Between 1,000 and 50,000 in the 'IN' clause. Creating a temp table and doing a join is also an option, but I'm not sure that would help. – Kevin Galligan Oct 12 '10 at 19:43
  • 50,000 ids in the IN clause will not be easy for any kind of DB to handle... – Thilo Oct 14 '10 at 01:35

3 Answers3

2

CouchDB and MongoDB and Riak are all going to be good at finding the key (account) relatively quickly.

The problems you're going to have (with any solution) are tied to the "order by" and "account in" clauses.

Problem #1: account in

120M records likely means gigabytes of data. You probably have an index over a gig. The reason this is a problem is that your "in" clause can easily span the whole index. If you search for accounts "0000001" and "9999581" you probably need to load a lot of index.

So just to find the records your DB first has to load potentially a gig of memory. Then to actually load the data you have to go back to the disk again. If your "accounts" on the in clause are not "close together" then you're going back multiple times to fetch various blocks. At some point it may be quicker to just do a table scan then to load the index and the table.

Then you get to problem #2...

Problem #2: order by

If you have a lot of data coming back from the "in" clause, then order by is just another layer of slowness. With an "order by" the server can't stream you the data. Instead it has to load all of the records in memory and then sort them and then stream them.

Solutions:

  1. Have lots of RAM. If the RAM can't fit the entire index, then the loads will be slow.
  2. Try limiting the number of "in" items. Even 20 or 30 items in this clause can make the query really slow.
  3. Try a Key-Value database?

I'm a big fan of K/V databases, but you have to look at point #1. If you don't have a lot of RAM and you have lots of data, then the system is going to run slowly no matter what DB you use. That RAM / DB size ratio is really important if you want good performance in these scenarios (small look-ups in big datasets).

Gates VP
  • 44,957
  • 11
  • 105
  • 108
1

Here's a reasonably sized example of a MySQL database using the innodb engine which takes advantage of clustered indexes on a table with approx. 125 million rows and with a query runtime of 0.021 seconds which seems fairly reasonable.

Rewriting mysql select to reduce time and writing tmp to disk

http://pastie.org/1105206

Other useful links:

http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html

http://dev.mysql.com/doc/refman/5.0/en/innodb-adaptive-hash.html

Hope it proves of interest.

Community
  • 1
  • 1
Jon Black
  • 16,223
  • 5
  • 43
  • 42
0

CouchDB will give you storage by key and you can create views to do the query/sorting. Second option could be cassandra, but there's a pretty big learning curve.

Matt Williamson
  • 39,165
  • 10
  • 64
  • 72