0

Rails 4, Mongoid instead of ActiveRecord (but this should change anything for the sake of the question).

Let's say I have a MyModel domain class with some validation rules:

class MyModel
  include Mongoid::Document

  field :text, type: String
  field :type, type: String

  belongs_to :parent

  validates :text, presence: true
  validates :type, inclusion: %w(A B C)
  validates_uniqueness_of :text, scope: :parent # important validation rule for the purpose of the question
end

where Parent is another domain class:

class Parent
    include Mongoid::Document

    field :name, type: String

    has_many my_models
end

Also I have the related tables in the database populated with some valid data.

Now, I want to import some data from an CSV file, which can conflict with the existing data in the database. The easy thing to do is to create an instance of MyModel for every row in the CSV and verify if it's valid, then save it to the database (or discard it).

Something like this:

csv_rows.each |data| # simplified 
  my_model = MyModel.new(data) # data is the hash with the values taken from the CSV row

  if my_model.valid?
    my_model.save validate: false
  else
    # do something useful, but not interesting for the question's purpose
    # just know that I need to separate validation from saving
  end
end

Now, this works pretty smoothly for a limited amount of data. But when the CSV contains hundreds of thousands of rows, this gets quite slow, because (worst case) there's a write operation for every row.

What I'd like to do, is to store the list of valid items and save them all at the end of the file parsing process. So, nothing complicated:

valids = []
csv_rows.each |data|
  my_model = MyModel.new(data)

  if my_model.valid?  # THE INTERESTING LINE this "if" checks only against the database, what happens if it conflicts with some other my_models not saved yet?
    valids << my_model
  else
    # ...
  end
end

if valids.size > 0
  # bulk insert of all data
end

That would be perfect, if I could be sure that the data in the CSV does not contain duplicated rows or data that goes against the validation rules of MyModel.


My question is: how can I check each row against the database AND the valids array, without having to repeat the validation rules defined into MyModel (avoiding to have them duplicated)?

Is there a different (more efficient) approach I'm not considering?

lucke84
  • 4,516
  • 3
  • 37
  • 58
  • Check validations without ActiveRecord, as reference: http://stackoverflow.com/questions/3563087/rails-validation-without-model and http://stackoverflow.com/questions/9816866/ruby-on-rails-how-to-validate-a-model-without-active-record – rlecaro2 Jan 27 '14 at 13:42
  • This is interesting, but does not seem to be quite what I'm looking for. This would be OK if I had to test only validation rules like presence or inclusion, but what about the rule that states that an instance of `MyModel` should be unique within the scope of its `Parent`? – lucke84 Jan 27 '14 at 13:54

1 Answers1

2

What you can do is validate as model, save the attributes in a hash, pushed to the valids array, then do a bulk insert of the values usint mongodb's insert:

valids = []
csv_rows.each |data|
  my_model = MyModel.new(data)

  if my_model.valid?
    valids << my_model.attributes
  end
end

MyModel.collection.insert(valids, continue_on_error: true)

This won't however prevent NEW duplicates... for that you could do something like the following, using a hash and compound key:

valids = {}
csv_rows.each |data|
  my_model = MyModel.new(data)

  if my_model.valid?
    valids["#{my_model.text}_#{my_model.parent}"] = my_model.as_document
  end
end

Then either of the following will work, DB Agnostic:

MyModel.create(valids.values)

Or MongoDB'ish:

MyModel.collection.insert(valids.values, continue_on_error: true)

OR EVEN BETTER

Ensure you have a uniq index on the collection:

class MyModel
  ...
  index({ text: 1, parent: 1 }, { unique: true, dropDups: true })
  ...
end

Then Just do the following:

MyModel.collection.insert(csv_rows, continue_on_error: true)

http://api.mongodb.org/ruby/current/Mongo/Collection.html#insert-instance_method http://mongoid.org/en/mongoid/docs/indexing.html

TIP: I recommend if you anticipate thousands of rows to do this in batches of 500 or so.

omarvelous
  • 2,774
  • 14
  • 18
  • All this sounds extremely good. AFAYK, is there a way to keep track of which items of `MyModel` have been skipped because of an error? – lucke84 Jan 28 '14 at 10:28
  • Yup, by adding the following argument, `collect_on_error: true`, it will return the errors: `:collect_on_error (Boolean) — default: +false+ — if true, then collects invalid documents as an array. Note that this option changes the result format.` - Via the doc http://api.mongodb.org/ruby/current/Mongo/Collection.html#insert-instance_method – omarvelous Jan 28 '14 at 15:58
  • The approach is interesting, even if a bit too tight to the database choice. If I changed the database, I would need to fix this. Keeping this in mind, it could be a nice solution. I'd wait a bit longer if someone comes out with a more application-level solution, otherwise I'll accept your answer. Thanks mate :) – lucke84 Jan 28 '14 at 16:15
  • That is true! The thing with Validation and Bulk inserts, I've found that I've always had to go down to the database layer to get the most optimal performance. You have to then ask yourself.... Is uniqueness THAT important? Especially when you can do things like "group", or limit 1 ordered by timestamp (or whatever other ways, etc.)? But I'd be curious to see other answers as well. I agree with you 100% that a more de-coupled approach would be THE best. – omarvelous Jan 28 '14 at 16:22
  • Rethought my answer based on your comments. Updated 2nd option to be more DB agnostic... Using `attributes` method instead of `as_document` and using `Model.create` as alternative. – omarvelous Jan 28 '14 at 16:27
  • I know, right? Unfortunately it's a constraint that we need to keep there. – lucke84 Jan 28 '14 at 16:28