10

I've got an index on columns a VARCHAR(255), b INT in an InnoDB table. Given two a,b pairs, can I use the MySQL index to determine if the pairs are the same from a c program (i.e. without using a strcmp and numerical comparison)?

  1. Where is a MySQL InnoDB index stored in the file system?
  2. Can it be read and used from a separate program? What is the format?
  3. How can I use an index to determine if two keys are the same?

Note: An answer to this question should either a) provide a method for accessing a MySQL index in order to accomplish this task or b) explain why the MySQL index cannot practically be accessed/used in this way. A platform-specific answer is fine, and I'm on Red Hat 5.8.


Below is the previous version of this question, which provides more context but seems to distract from the actual question. I understand that there are other ways to accomplish this example within MySQL, and I provide two. This is not a question about optimization, but rather of factoring out a piece of complexity that exists across many different dynamically generated queries.

I could accomplish my query using a subselect with a subgrouping, e.g.

SELECT c, AVG(max_val)
FROM (
    SELECT c, MAX(val) AS max_val
    FROM table
    GROUP BY a, b) AS t
GROUP BY c

But I've written a UDF that allows me to do it with a single select, e.g.

SELECT b, MY_UDF(a, b, val)
FROM table
GROUP by c

The key here is that I pass the fields a and b to the UDF, and I manually manage a,b subgroups in each group. Column a is a varchar, so this involves a call to strncmp to check for matches, but it's reasonably fast.

However, I have an index my_key (a ASC, b ASC). Instead of checking for matches on a and b manually, can I just access and use the MySQL index? That is, can I get the index value in my_key for a given row or a,b pair in c (inside the UDF)? And if so, would the index value be guaranteed to be unique for any value a,b?

I would like to call MY_UDF(a, b, val) and then look up the mysql index value (a,b) in c from the UDF.

jmilloy
  • 7,875
  • 11
  • 53
  • 86
  • doing this is often a bad idea performance wise... – Mitch Wheat Nov 10 '12 at 00:01
  • Depends on how deep you want to get into this: other answers here show you'd have to [write an extension in C/C++](http://stackoverflow.com/questions/8947232/how-to-create-a-mysql-stored-aggregate-function). If you passed `b` as well, though, you could basically do at least _some_ of the work in a simpler, SQL-only function - however, your statement would probably look like the subselect anyways, so I'm not sure you'd benefit. – Clockwork-Muse Nov 10 '12 at 00:05
  • 2
    @Clockwork-Muse I know how to write an extension in C, and it's called a udf. I'm asking specifically about accessing a mysql index from the udf. You're right, putting the subselect somewhere else is not the solution here. – jmilloy Nov 10 '12 at 00:07
  • 1
    Granting that I've never written one: No, I don't think so. Or at least, not without writing something _deployment specific_ (and really hacky) - you'd probably have to hard-code the name of the table. Given the way the calling-order for [functions seems to work](http://dev.mysql.com/doc/refman/5.6/en/adding-udf.html), you don't actually even see the actual table-object (or indices) itself, you get rows pre-sorted and pre-grouped from the optimizer. Personally, I wouldn't want to anyways, as it allows better future flexibility (in case a different index should be chosen). – Clockwork-Muse Nov 10 '12 at 00:24
  • 2
    What are you trying to achieve by selecting ungrouped and unaggregated `c` in your innermost query? Could you please provide some sample data and desired output? – Quassnoi Nov 14 '12 at 22:24
  • 1
    @jmilloy: You have a bounty because *"This question has not received enough attention."*, yet you don't answer the users' comments and questions. Why do you think we should give attention then, if you are not giving much? – ypercubeᵀᴹ Nov 21 '12 at 20:25
  • 1
    @ypercube What question do you have? The answer to Quassnoi's questions are quite obvious and, regardless, knowing the answers won't help you answer my question, so I ignored it. I select `c` because I need it in the outer query. In one query, it's a hierarchy, so groups of `c` have one or more subgroups of `a,b`, but groups of `a,b` always share the same value of `c`. In another, I don't care which value of `c` I get, but I need one of them. Anyone could have guessed these as they are the only two options; and what information have you gathered relating to accessing index values? – jmilloy Nov 22 '12 at 03:43
  • 1
    As it turns out, I gather that the answer to my question is "no." I'm not interested in other ways to solve this problem (as I have several and they are obvious), but very specifically in whether it is possible to do it this way. I wanted enough attention to convince that if it were possible, someone who knew would see it. No need to answer misdirected questions once the answer has become clear. – jmilloy Nov 22 '12 at 03:46
  • 1
    The real problem you are trying to solve, is obviously not obvious, since a 128K user (who answers mainly in the SQL territory) asked you some questions. You chose to ignore them. – ypercubeᵀᴹ Nov 22 '12 at 07:16
  • 2
    It sounds like you are looking for a solution that borders on a hack. If my understanding is correct, take a look at [this post](http://yoshinorimatsunobu.blogspot.com/2009/08/accessing-mysql-tables-from-udf-storage.html): the guy makes no bones about it being a hack, although including `mysql_priv.h` gives away the hacking nature of the implementation anyway. It looks like his code can access mysql tables in memory from within the body of a UDF; the same path *may* lead you to a solution that lets you access indexes as well. – Sergey Kalinichenko Nov 26 '12 at 20:59
  • 1
    @dasblinkenlight Thanks. Is it a hack when MySQL accesses its own indices? I just want to do the same thing. Anyways, the mysql_priv header sounds like the right place to start to learn how it does access them. I've been hoping someone here Just Knows How. – jmilloy Nov 26 '12 at 21:13
  • 4
    @jmilloy The difference between MySQL accessing a table and a UDF accessing a table using MySQL's *private* header is that a UDF does not have the same release cycle as MySQL: if designers of MySQL decide to change their headers and the internals tomorrow, they would also change the rest of the code to match the change in the header, and be fine; in contrast, UDF would just break. Even worse, they could make a breaking change to their semantic without letting anybody know. To them, it's a fair game, because the header is private. To everyone else relying on private headers, it's a lot of risk. – Sergey Kalinichenko Nov 26 '12 at 21:52

4 Answers4

6

Look back at your original query

SELECT c, AVG(max_val)
FROM
(
    SELECT c, MAX(val) AS max_val
    FROM table
    GROUP BY a, b
) AS t
GROUP BY c;

You should first make sure the subselect gives you what you want by running

SELECT c, MAX(val) AS max_val
FROM table
GROUP BY a, b;

If the result of the subselect is correct, then run your full query. If that result is correct, then you should do the following:

ALTER TABLE `table` ADD INDEX abc_ndx (a,b,c,val);

This will speed up the query by getting all needed data from the index only. The source table never needs to be consulted.

Writing a UDF is and calling it a single SELECT is just masquerading a subselect and creating more overhead than the query needs. Simply placing your full query (one nested pass over the data) in the Stored Procedure will be more effective that getting most of the data in the UDF and executing single row selects iteratively ( something like O(n log n) running time with possible longer Sending data states).

UPDATE 2012-11-27 13:46 EDT

You can access the index without touching the table by doing two things

  • Create a decent Covering Index

    ALTER TABLE table ADD INDEX abc_ndx (a,b,c,val);

  • Run the SELECT query I mentioned before

Since the all the columns of the query all in the index, the Query Optimizer will only touch the index (or precache index pages). If the table is MyISAM, you can ...

  1. setup the MyISAM table to have a dedicated key cache that can be preloaded on mysqld startup
  2. run SELECT a,b,c,val FROM table; to load index pages into MyISAM's default keycache

Trust me, you really do not want to access index pages against mysqld's will. What do I mean by that?

For MyISAM, the index pages for a MyISAM table are stored in the .MYI file of the table. Each DML statement will summon a full table lock.

For InnoDB, the index pages are loaded into the InnoDB Buffer Pool. Consequently, the associated data pages will load into the InnoDB Buffer Pool as well.

You should not have to circumvent access to index pages using Python, Perl, PHP, C++, or Java because of the constant I/O needed by MyISAM or the constant MVCC protocols being exercised by InnoDB.

There is a NoSQL paradigm (called HandlerSocket) that would permit low-level access to MySQL tables that can cleanly bypass mysqld's normal access patterns. I would not recommend it since there was a bug in it when using it to issue writes.

UPDATE 2012-11-30 12:11 EDT

From your last comment

I'm using InnoDB, and I can see how the MVCC model complicates things. However, apparently InnoDB stores only one version (the most recent) in the index. The access pattern for the relevant tables is write-once, read-many, so if the index could be accessed, it could provide a single, reliable datum for each key.

When it comes to InnoDB, MVCC is not complicating anything. It can actually become your best friend provided:

  • if you have autocommit enabled (It should be enabled by default)
  • the access pattern for the relevant tables is write-once, read-many

I would expect the accessed index pages to be sitting in the InnoDB Buffer Pool virtually forever if it is read repeatedly. I would just make sure your innodb_buffer_pool_size is set high enough to hold necessary InnoDB data.

Community
  • 1
  • 1
RolandoMySQLDBA
  • 43,883
  • 16
  • 91
  • 132
  • 1
    The reason I want to use the UDF is not for speed. Neither query has speed problems. The queries are being generated in python with many variations depending on the use case, and writing a single UDF greatly reduced the complexity of the python code. – jmilloy Nov 20 '12 at 13:37
  • 1
    I'm using InnoDB, and I can see how the MVCC model complicates things. However, apparently InnoDB stores only one version (the most recent) in the index. The access pattern for the relevant tables is write-once, read-many, so if the index could be accessed, it could provide a single, reliable datum for each key. Does it? (http://stackoverflow.com/questions/4841692/how-indices-cope-with-mvcc). – jmilloy Nov 30 '12 at 16:59
  • I'm reading http://dev.mysql.com/doc/refman/5.0/en/innodb-multi-versioning.html and now http://dev.mysql.com/doc/refman/5.0/en/innodb-table-and-index.html – jmilloy Nov 30 '12 at 17:04
4

You probably cannot access the key directly. I don't think this would actually make any difference performance-wise.

If you set covering indizes in the right order MySQL will not fetch a single page from the hard disk but deliver the result directly out of the index. There's nothing faster than this.

Note that your subselect may end up in a temptable on disk if its result is getting larger than your tmp_table_size or max_heap_table_size.

Check the status of Created_tmp_tables_disk_tables if you're not sure.

More on how MySQL is using internal temporary tables you find here http://dev.mysql.com/doc/refman/5.5/en/internal-temporary-tables.html

If you want, post your table structure for a review.

Michel Feldheim
  • 17,625
  • 5
  • 60
  • 77
  • 1
    Hmmm. I'd like to use the key to determine quickly if two rows are distinct according to that index. How will a covering indices or temp tables affect my ability to determine if two rows are distinct inside the UDF? – jmilloy Nov 15 '12 at 20:42
  • I am talking about optimizing queries using MySQLs' own features rather than replacing it's given functionality in UDFs – Michel Feldheim Nov 15 '12 at 22:43
  • 1
    Okay. The subselect version is optimized just fine for my needs, and I'm replacing it with a single select + udf for other reasons. Thanks though. – jmilloy Nov 15 '12 at 23:04
4

If you just want to access an index outside of MySQL, you will have to use the API for one of the MySQL storage engines. The default engine is InnoDB. See overview here: InnoDB Internals. This describes (at a very high level) both the data layout on disk and the APIs to access it. A more detailed description is here: Embedded InnoDB.

However, rather than write your own program that uses InnoDB APIs directly (which is a lot of work), you might use one of the projects that have already done that work:

  • HandlerSocket: gives NoSQL access to InnoDB tables, runs in a UDF. See a very informative blog post from the developer. The goal of HandlerSocket is to provide a NoSQL interface exposed as a network daemon, but you could use the same technique (and much of the same code) to provide something that would be used by a query withing MySQL.

  • memcached InnoDB plugin. gives memcached style access to InnoDB tables.

  • HailDB: gives NoSQL access to InnoDB tables, runs on top of Embedded InnoDB. see conference presentation. EDIT: HailDB probably won't work running side-by-side with MySQL.

I believe any of these can run side-by-side with MySQL (using the same tables live), and can be used from C, so they do meet your requirements.

If you can use/migrate to MySQL Cluster, see also NDB API, a direct API, and ndbmemcache, a way to access MySQL Cluster using memcache API.

This is hard to answer without knowing why you are trying to do this, because the implications of different approaches are very different.

Alex I
  • 19,689
  • 9
  • 86
  • 158
  • 1
    Thanks. I tried to give a lot of detail about exactly why I became interested in this. I'm dynamically generating queries which can group the data in any of several levels in a hierarchy, but may require subgroupings to aggregate data correctly and expose the correct data for a particular join. I vastly simplified the sql generation with a UDF which handles the relevant distinct/subgroup/aggregate functionality without losing much speed. But I wondered if, inside the UDF, I could leverage a specific index to determine if two rows are distinct according to the index. – jmilloy Nov 30 '12 at 17:19
  • 1
    I'm using InnoDB, and the relevant rows are written once, and then subsequently read, without updates. – jmilloy Nov 30 '12 at 17:22
  • 1
    @jmilloy: Sounds like you want to directly access tables from within a UDF same as HandlerSocket does. HandlerSocket is probably the best starting point for you, you can get it and hack on it until it does exactly what you want. You can probably rip out the network (daemon) part of it completely. – Alex I Nov 30 '12 at 20:52
0

No. There is no practical way to make use of a MySQL index, from within a C program, accessing a MySQL index in a means other than the MySQL engine, to check whether two (a,b) pairs (keys) are the same or not.

There are more practical solutions which don't require accessing MySQL datafiles outside of the MySQL engine or writing a user-defined function.


Q: Do you know where the mysql index is stored in the file system?

The location the index within the file system is going to depend on the storage engine for the table. For MyISAM engine, the indexes are stored in .MYI files under the datadir/database directory; InnoDB indexes are stored within an InnoDB managed tablespace file. f innodb_file_per_table variable was set when the table was created, there will be a separate .ibd file for each table under the innodb_data_home_dir/database subdirectory.

Q: Do you know what the format is?

The storage format of each storage engine is different, MyISAM, InnoDB, et al., and also depends on the version. I have some familiarity with how the data is stored, in terms of what MySQL requires of the storage engine. Detailed information about the internals would be specific to each engine.

Q: What makes it impractical?

It's impractical because it's a whole lot of work, and it's going to be dependent on details of storage engines that are likely to change in the future. It would be much more practical to define the problem space, and to write a SQL statement that would return what you want.

As Quassnoi pointed out in his comment to your question, it's not at all clear what particular problem you are trying to solve by creating a UDF or accessing MySQL indexes from outside of MySQL. I'm certain that Quassnoi would have a good way to accomplish what you need with an efficient SQL statement.

spencer7593
  • 106,611
  • 15
  • 112
  • 140
  • 1
    I don't disagree with you, but... Do you know where the mysql index is stored in the file system? Do you know what the format is? What makes it impractical? – jmilloy Nov 30 '12 at 16:46
  • 1
    I'm using InnoDB. I don't believe innodb_file_per_table was used. – jmilloy Nov 30 '12 at 21:09
  • @spencer7593, It's quite clear what he wants to do here. He wants to access the InnoDB **storage engine**. He doesn't want to access the MySQL database engine. – Pacerier Jan 19 '15 at 14:17
  • @Pacerier: Thanks for the comment. If you look at the edit history for the question, you will note that the reference to "InnoDB" was added 3 1/2 hours *after* my edit to this answer. When I posted my answer, there was no reference to InnoDB in the question, or any storage engine, there was a reference to MySQL. OP left a comment to my short answer, and asked three questions. I edited my answer to add a followup, to answer the three specific questions OP asked. The reference to **InnoDB** was added to the question 3 1/2 hours *after* my last edit to the question. – spencer7593 Jan 19 '15 at 22:56
  • @Pacerier: What I said was "not at all clear" was **what problem** OP was trying to solve by accessing the index and/or a UDF. Yes, he asked about accessing the index outside of MySQL. That was clear. But what I wasn't satisfied with was the description of **why** this needed to be done in the first place. What actual **problem** would be solved by doing this. – spencer7593 Jan 19 '15 at 23:12