4

In my application, say, animals have many photos. I'm querying photos of animals such that I want all photos of all animals to be displayed. However, I want each animal to appear as a photo before repetition occurs.

Example: 

animal instance 1, 'cat', has four photos, 
animal instance 2, 'dog', has two photos:

photos should appear ordered as so:

#photo         belongs to     #animal

tiddles.jpg ,                  cat 
fido.jpg                       dog 
meow.jpg                       cat 
rover.jpg                      dog 
puss.jpg                       cat 
felix.jpg,                     cat  (no more dogs so two consecutive cats)
  • Pagination is required so I can't order on an array.
  • Filename structure/convention provides no help, though the animal_id exists on each photo.
  • Though there are two types of animal in this example this is an active record model with hundreds of records.
  • Animals may be selectively queried.

If this isn't possible with active_record then I'll happily use sql; I'm using postgresql.

My brain is frazzled so if anyone can come up with a better title, please go ahead and edit it or suggest in comments.

mark
  • 10,316
  • 6
  • 37
  • 58

6 Answers6

3

Here is a PostgreSQL specific solution:

batch_id_sql = "RANK() OVER (PARTITION BY animal_id ORDER BY id ASC)"

Photo.paginate(
  :select => "DISTINCT photos.*, (#{batch_id_sql}) batch_id",
  :order  => "batch_id ASC, photos.animal_id ASC",
  :page   => 1)

Here is a DB agnostic solution:

batch_id_sql = "
 SELECT COUNT(bm.*) 
 FROM   photos bm 
 WHERE  bm.animal_id = photos.animal_id AND
        bm.id <= photos.id 
"

Photo.paginate(
  :select => "photos.*, (#{batch_id_sql}) batch_id",
  :order  => "batch_id ASC, photos.animal_id ASC",
  :page   => 1)

Both queries work even when you have a where condition. Benchmark the query using expected data set to check if it meets the expected throughput and latency requirements.

Reference

PostgreSQL Window function

Harish Shetty
  • 64,083
  • 21
  • 152
  • 198
  • Oh this is great! The first seems not to work though I can't really understand why not; the order is random with early repetition of animals' photos. The second works perfectly though presents a new unrelated problem. Strangely, if I try to include (eager load) associated models then the dynamic batch_id field is reported not found. It's not a requirement of the question and you've been very helpful but if you could point your brain at this I'd be very grateful. – mark May 08 '11 at 10:44
  • I have tested both queries and they worked for me. Can you post the SQL statement from the error log on pastie(pastie.org) for me to verify? We probably should try to get the first version to work for you as it is more efficient than the second. – Harish Shetty May 08 '11 at 15:20
  • Just been looking at this and the first is in fact fine and as you say it's also 300% times quicker. The eager loading is a rails thing about includes causing selects to be ignored, I just need to work around that. – mark May 09 '11 at 18:00
  • Try adding a `DISTINCT` clause to the select statement to eliminate duplicates. – Harish Shetty May 09 '11 at 19:09
  • You're awesome! I really wasn't expecting you to answer that, thinking it was something I'd have to work around but here you are, helping me out again. You just can't help yourself, can you? Thank you so much for all your help. Tell your loved ones you did a good thing and ask them to hug you, for me. Salud! :) – mark May 09 '11 at 20:53
  • Oh if you wanna have a look at this in action, look at my profile. Hobby app I'm working on; not really animals though. – mark May 09 '11 at 20:56
  • It was an interesting problem. It made me think, I learnt that PostgreSQL supports statistical functions. So it was not a bad deal for me :-). Also, since I have spent so some time on this question it was important for me to find a workable solution. – Harish Shetty May 10 '11 at 01:52
1

New solution

Add an integer column called batch_id to the animals table.

class AddBatchIdToPhotos < ActiveRecord::Migration
  def self.up
    add_column    :photos,   :batch_id, :integer
    set_batch_id
    change_column :photos,   :batch_id, :integer, :nil => false
    add_index     :photos,   :batch_id
  end

  def self.down
    remove_column :photos,   :batch_id
  end

  def self.set_batch_id
    # set the batch id to existing rows
    # implement this
  end
end

Now add a before_create on the Photo model to set the batch id.

class Photo
  belongs_to     :animal
  before_create  :batch_photo_add
  after_update   :batch_photo_update
  after_destroy  :batch_photo_remove

private

  def batch_photo_add
    self.batch_id = next_batch_id_for_animal(animal_id)
    true
  end

  def batch_photo_update
    return true unless animal_id_changed?
    batch_photo_remove(batch_id, animal_id_was)
    batch_photo_add
  end

  def batch_photo_remove(b_id=batch_id, a_id=animal_id)
    Photo.update_all("batch_id = batch_id- 1", 
      ["animal_id = ? AND batch_id > ?", a_id, b_id])
    true
  end

  def next_batch_id_for_animal(a_id)
    (Photo.maximum(:batch_id, :conditions => {:animal_id => a_id}) || 0) + 1
  end
end

Now you can get the desired result by issuing simple paginate command

@animal_photos = Photo.paginate(:page => 1, :per_page => 10, 
                     :order => :batch_id)

How does this work?

Let's consider we have data set as given below:

id  Photo Description    Batch Id
1   Cat_photo_1          1
2   Cat_photo_2          2
3   Dog_photo_1          1
2   Cat_photo_3          3
4   Dog_photo_2          2
5   Lion_photo_1         1
6   Cat_photo_4          4

Now if we were to execute a query ordered by batch_id we get this

# batch 1 (cat, dog, lion)
Cat_photo_1
Dog_photo_1
Lion_photo_1

# batch 2 (cat, dog)
Cat_photo_2
Dog_photo_2

# batch 3,4 (cat)
Cat_photo_3
Cat_photo_4

The batch distribution is not random, the animals are filled from the top. The number of animals displayed in a page is governed by per_page parameter passed to paginate method (not the batch size).

Old solution

Have you tried this?

If you are using the will_paginate gem:

# assuming you want to order by animal name
animal_photos = Photo.paginate(:include => :animal, :page => 1, 
                  :order => "animals.name")

animal_photos.each do |animal_photo|
  puts animal_photo.file_name
  puts animal_photo.animal.name
end
Harish Shetty
  • 64,083
  • 21
  • 152
  • 198
  • Thanks for your answer. I'm not ordering by name but photos of instances of animal, turn based on each animal. I'm going to try and clarify the question. – mark May 06 '11 at 15:20
  • Are the type of the animal fixed(i.e. only cats and dogs) – Harish Shetty May 06 '11 at 20:19
  • No, there are hundreds of animals. – mark May 07 '11 at 08:03
  • Updated my solution, take a look. – Harish Shetty May 07 '11 at 10:03
  • Clever answer but I think it would cause problems with photos of queried sets of animals. I need to think about it some more but going to put a bounty on this later see what comes up. Thanks! – mark May 07 '11 at 19:52
  • What are problems you foresee? Cost of creating batch id during insert (most likely scenario) is very low. The final query meets the stated requirements, supports pagination and there is NO additional cost for this query. – Harish Shetty May 07 '11 at 22:05
  • It's likely to be sound and you probably know better than me but I still don't understand how this will work, for example a query might return a subset af animals say batched 1-10 of cats and 20-30 of dogs. I'm going to put a bounty on this, see what comes from it, which you might well end up receiving. – mark May 07 '11 at 23:46
  • I have added extra description to the answer. If you have `where` conditions on the query the distribution of the animals across batches is not uniform. Since this is a purpose specific optimization, you can further fine tune the solution to address the distribution issue. Performing the grouping on the result-set is very expensive in this case. I am curious to see other solutions. – Harish Shetty May 08 '11 at 02:04
1

Having no experience in activerecord. Using plain PostgreSQL I would try something like this:

Define a window function over all previous rows which counts how many time the current animal has appeared, then order by this count.

SELECT
   filename,
   animal_id,
   COUNT(*) OVER (PARTITION BY animal_id ORDER BY filename) AS cnt
FROM
   photos
ORDER BY
   cnt,
   animal_id,
   filename

Filtering on certain animal_id's will work. This will always order the same way. I don't know if you want something random in there, but it should be easily added.

ontrack
  • 2,963
  • 2
  • 15
  • 14
  • Thanks very much for your answer but it causes random early repetition; I don't know why at this point. I think you've probably influenced the direction of the approach taken in the solution by Kandada and wish I could split the bounty. – mark May 08 '11 at 10:47
  • Yeah, I don't blama Kandada for that. The repition may be caused because you're joining some other things with this? Maybe you should look at a simpler version of your resultset to detect where the doubles are coming from, and thén look at the ordering. – ontrack May 08 '11 at 11:50
  • I saw your partition answer after posting my new answer. I was reading PostgreSQL documentation and found out that they support Window functions(I knew MsSQL supported this). I must say I was pleasantly surprised. – Harish Shetty May 08 '11 at 15:16
1

I'd recommend something hybrid/corrected based on KandadaBoggu's input.

First off, the correct way to do it on paper is with row_number() over (partition by animal_id order by id). The suggested rank() will generate a global row number, but you want the one within its partition.

Using a window function is also the most flexible solution (in fact, the only solution) if you want to plan to change the sort order here and there.

Take note that this won't necessarily scale well, however, because in order to sort the results you'll need to:

  • fetch the whole result set that matches your criteria
  • sort the whole result set to create the partitions and obtain a rank_id
  • top-n sort/limit over the result set a second time to get them in their final order

The correct way to do this in practice, if your sort order is immutable, is to maintain a pre-calculated rank_id. KandadaBoggu's other suggestion points in the correct direction in this sense.

When it comes to deletes (and possibly updates, if you don't want them sorted by id), you may run into issues because you end up trading faster reads for slower writes. If deleting the cat with an index of 1 leads to updating the next 50k cats, you're going to be in trouble.

If you've very small sets, the overhead might be very acceptable (don't forget to index animal_id).

If not, there's a workaround if you find the order in which specific animals appear is irrelevant. It goes like this:

  1. Start a transaction.

  2. If the rank_id is going to change (i.e. insert or delete), obtain an advisory lock to ensure that two sessions can't impact the rank_id of the same animal class, e.g.:

    SELECT pg_try_advisory_lock('the_table'::regclass, the_animal_id);
    

    (Sleep for .05s if you don't obtain it.)

  3. On insert, find max(rank_id) for that animal_id. Assign it rank_id + 1. Then insert it.

    On delete, select the animal with the same animal_id and the largest rank_id. Delete your animal, and assign its old rank_id to the fetched animal (unless you were deleting the last one, of course).

  4. Release the advisory lock.

  5. Commit the work.

Note that the above will make good use of an index on (animal_id, rank_id) and can be done using plpgsql triggers:

create trigger "__animals_rank_id__ins"
before insert on animals
for each row execute procedure lock_animal_id_and_assign_rank_id();

create trigger "_00_animals_rank_id__ins"
after insert on animals
for each row execute procedure unlock_animal_id();

create trigger "__animals_rank_id__del"
before delete on animals
for each row execute procedure lock_animal_id();

create trigger "_00_animals_rank_id__del"
after delete on animals
for each row execute procedure reassign_rank_id_and_unlock_animal_id();

You can then create a multi-column index on your sort criteria if you're not joining all over them place, e.g. (rank_id, name). And you'll end up with a snappy site for reads and writes.

Denis de Bernardy
  • 75,850
  • 13
  • 131
  • 154
  • My original solution caters for deletes and updates also. Look at `batch_photo_remove` and `batch_photo_update` methods in my solution. – Harish Shetty May 08 '11 at 15:23
  • Sorry about that. My incompetence in Ruby played tricks on me. :-) Updating my answer accordingly. – Denis de Bernardy May 08 '11 at 15:27
  • Thank you so much for that amazing answer. To be honest most of it is over my head and I'd need to read up on it. I will refer back to it if I need to optimize performance. The app is fairly small and probably will manage ok. Thanks again! – mark May 09 '11 at 18:03
0

You should be able to get the pictures (or filenames, anyway) using ActiveRecord, ordered by name.

Then you can use Enumerable#group_by and Enumerable#zip to zip all the arrays together.

If you give me more information about how your filenames are really arranged (i.e., are they all for sure with an underscore before the number and a constant name before the underscore for each "type"? etc.), then I can give you an example. I'll write one up momentarily showing how you'd do it for your current example.

Platinum Azure
  • 45,269
  • 12
  • 110
  • 134
  • Thanks for your reply. I'm sorry but the filenames provided are a simplified example. Additionally, I will require pagination. Sorry I didn't include that in my question, it's late here. – mark May 05 '11 at 21:55
  • Tricky... Are you trying to retrieve the filenames, or the images themselves as well? – Platinum Azure May 05 '11 at 22:14
  • Well, the image model stores the path to the attached filename. I'm using paperclip. – mark May 05 '11 at 22:16
0

You could run two sorts and build one array as follows:

result1= The first of each animal type only. use the ruby "find" method for this search.

result2= All animals, sorted by group. Use "find" to again find the first occurrence of each animal and then use "drop" to remove those "first occurrences" from result2.

Then: markCustomResult = result1 + result2

Then: You can use willpaginate on markCustomResult

Perry Horwich
  • 2,798
  • 3
  • 23
  • 51