0

I need to get a stratified sample of my huge table. Specifically, I want to select 1/n rows from my table without bias, i.e. select randomly, select every nth row, etc.

Before I asked this question, I tried doing this. However, it didn't work for me because I am using the InfiniDB engine and, as I found out later, it doesn't support variables in sub-expressions, or something like that. Does anyone know a way to do this without user variables?

I was thinking about something like this: in my table, every row has a unique alphanumeric string id, which can look like "1234567890", or like "abcdef12345". I was thinking of somehow converting that string to a number, and then using the modulo function to only select 1/n rows from my table. However, I have no idea how to do the conversion, as this string is not hexadecimal.

Note: my table does not have an autoincremented column.

Shadow
  • 33,525
  • 10
  • 51
  • 64
Bluefire
  • 13,519
  • 24
  • 74
  • 118
  • selecting every nth row is not random. – Dan Bracuk Jul 21 '14 at 12:40
  • @DanBracuk True, but it's also completely unbiased and provides a stratified sample, which is what I want. – Bluefire Jul 21 '14 at 12:42
  • Is your alphanumeric string hexadecimal and completely contiguous? – Arth Jul 21 '14 at 12:42
  • @Arth no, sadly it's not hexadecimal. If, however, by contiguous you mean 'without spaces', then yes, it is. – Bluefire Jul 21 '14 at 12:44
  • How do you determine which alphanumeric string to give each value? – Arth Jul 21 '14 at 12:52
  • @Arth I'm not sure, but it's not autoincremented. – Bluefire Jul 21 '14 at 12:58
  • Just to clarify when you say 'without spaces' do you mean without space characters or that 'abcdef12345' will be followed by 'abcdef12346'? And which characters are permitted in your string? – Arth Jul 21 '14 at 13:05
  • @Arth I mean without space characters. As for the chars allowed, it's 0-9 and a-z (lower case). – Bluefire Jul 21 '14 at 13:25
  • Sorry, consecutive may have been a better word. If they are not consecutive, then how do you expect to get an integer value that increases by 1 for each row? Which is what you need for modulus in this case. Without an ability to generate a row number, I don't think this is possible. – Arth Jul 21 '14 at 13:29
  • @Arth I don't want an integer value that increases by 1 each row. Why would I need that? All I need is for 2 different strings to always be assigned 2 different numbers. – Bluefire Jul 21 '14 at 16:23
  • @Bluefire, to use modulo you'd need it.. otherwise you'd have no idea how many rows you were returning. That's how both the suggestions work below. – Arth Jul 21 '14 at 16:36
  • @Arth right, but I don't need a precise amount of rows; that was made obvious by my willingness to randomise the selection. If I convert each string to a number and filter by modulo *n*, then I will select *approximately* 1 / *n* of the entries, which is fine. – Bluefire Jul 21 '14 at 23:25
  • @Bluefire Oh ok, that wasn't obvious. – Arth Jul 22 '14 at 09:32

1 Answers1

3

This is complicated, but you can do it. It requires a self-join and aggregation, implemented in this query using a correlated subquery. My guess is that this will not perform well, because you presumably have a large table. For a 10% sample, it would look like:

select ht.*,
       (select count(*)
        from hugetable ht2
        where ht2.col < ht.col or
              (ht2.col = ht.col and ht2.id <= ht.id)
       ) as rn
from hugetable ht
having rn % 10 = 1;

Note that the use of having in this context is specific to MySQL. It allows you to filter the rows without using a subquery.

EDIT:

Probably the only feasible approach -- it you can do it -- is to create another table with an auto-incremented id. Here is a stripped down version:

create table temp (
    id int auto_increment,
    idstring varchar(255),
    col varchar(255)
);

insert into temp(idstring, col)
    select idstring, col
    from hugetable ht
    order by col;

select *
from temp
where id % 10 = 1;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thanks! I'll try the first way, but the second one won't work because InfiniDB doesn't support constraints such as `AUTO_INCREMENT`. – Bluefire Jul 21 '14 at 16:24
  • @GordonLinoff Where are you getting `ht1` from in the 1st approach? Shouldn't this be `ht`? – Arth Jul 21 '14 at 16:40
  • @Arth . . . Let's call that a typo. It should be correlated with the outer alias, `ht`. – Gordon Linoff Jul 21 '14 at 19:22