14

I've been using Django Haystack for a while now and it's great! I have a rather heavy site with data which needs to be updated from time to time (15 to 30 mins).

When using the python manage.py update_index it takes lots of time to update the data. Is there a way to speed this up? Or maybe update only changed data if possible..

I'm currently using Django Haystack 1.2.7 with Solr as backend and Django 1.4.

Thanks!!!


EDIT:

Yes I've tried reading that part of the documentation but what I really need is a way to speed the indexing up. Maybe update only recent data instead of updating all. I've found get_updated_field but don't know how to use it. In the documentation it's only mentioned why it's used but no real examples are shown.


EDIT 2:

start = DateTimeField(model_attr='start', null=True, faceted=True, --HERE?--)

EDIT 3:

Ok i've implemented the solution bellow but when i tried rebuild_index (with 45000 data) it almost crashed my computer. After 10 mins of waiting an error appeared:

 File "manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
  File "/usr/local/lib/python2.7/dist-packages/django/core/management/__init__.py", line 443, in execute_from_command_line
    utility.execute()
  File "/usr/local/lib/python2.7/dist-packages/django/core/management/__init__.py", line 382, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/usr/local/lib/python2.7/dist-packages/django/core/management/base.py", line 196, in run_from_argv
    self.execute(*args, **options.__dict__)
  File "/usr/local/lib/python2.7/dist-packages/django/core/management/base.py", line 232, in execute
    output = self.handle(*args, **options)
  File "/usr/local/lib/python2.7/dist-packages/haystack/management/commands/rebuild_index.py", line 16, in handle
    call_command('update_index', **options)
  File "/usr/local/lib/python2.7/dist-packages/django/core/management/__init__.py", line 150, in call_command
    return klass.execute(*args, **defaults)
  File "/usr/local/lib/python2.7/dist-packages/django/core/management/base.py", line 232, in execute
    output = self.handle(*args, **options)
  File "/usr/local/lib/python2.7/dist-packages/haystack/management/commands/update_index.py", line 193, in handle
    return super(Command, self).handle(*apps, **options)
  File "/usr/local/lib/python2.7/dist-packages/django/core/management/base.py", line 304, in handle
    app_output = self.handle_app(app, **options)
  File "/usr/local/lib/python2.7/dist-packages/haystack/management/commands/update_index.py", line 229, in handle_app
    do_update(index, qs, start, end, total, self.verbosity)
  File "/usr/local/lib/python2.7/dist-packages/haystack/management/commands/update_index.py", line 109, in do_update
    index.backend.update(index, current_qs)
  File "/usr/local/lib/python2.7/dist-packages/haystack/backends/solr_backend.py", line 73, in update
    self.conn.add(docs, commit=commit, boost=index.get_field_weights())
  File "/usr/local/lib/python2.7/dist-packages/pysolr.py", line 686, in add
    m = ET.tostring(message, encoding='utf-8')
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1127, in tostring
    ElementTree(element).write(file, encoding, method=method)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 821, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 940, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 940, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 915, in _serialize_xml
    write("<" + tag)
MemoryError
prototype
  • 3,303
  • 2
  • 27
  • 42

1 Answers1

22

get_updated_field should return a string that contains the name of the attribute on the model that contains the date that the model was updated (haystack docs). A DateField with auto_now=True would be ideal for that (Django docs).

For example, my UserProfile model has a field named updated

models.py

class UserProfile(models.Model):
    user = models.ForeignKey(User)
    # lots of other fields snipped
    updated = models.DateTimeField(auto_now=True)

search_indexes.py

class UserProfileIndex(SearchIndex):
    text = CharField(document=True, use_template=True)
    user = CharField(model_attr='user')
    user_fullname = CharField(model_attr='user__get_full_name')

    def get_model(self):
        return UserProfile

    def get_updated_field(self):
        return "updated"

Then when I run ./manage.py update_index --age=10 it only indexes the user profiles updated in the last 10 hours.

Stephen Paulger
  • 5,204
  • 3
  • 28
  • 46
  • Where should I add auto_now=True in the search_indexes.py? I've made an example in my question above. Also where exactly should I implement get_updated_field. Thanks for your answer!! – prototype Dec 12 '12 at 08:30
  • The auto_now would go on the Model in models.py, the function get_updated_field would go in the SearchIndex class. – Stephen Paulger Dec 16 '12 at 20:00
  • Your solution will work i have no doubt about it.. but when i ran rebuild_index an error appeared.. – prototype Dec 18 '12 at 10:45
  • I have no doubt either as it works for me :) What does your error say? If you also pass `-v 2` to the update_index command you'll get more verbose output. – Stephen Paulger Dec 18 '12 at 14:05
  • 6
    Careful if your search index model has references to other models that are part of the index. The last updated wont get changed when they change and then they wont index (imagine category models on a main object) – dalore Aug 14 '13 at 16:08
  • Dalore's concern is solved by saving the main model object at the same time you save the related model object. – user2104778 Jan 04 '14 at 07:04
  • 3
    FYI if you use Django's QuerySet method for bulk updates `.update()`, the `auto_now` feature will not be honored as the post_save signal is not triggered. This means the `--age` option above won't work to update only recently changed models. To get around this you can loop the queryset and use `.save()`, or to continue using `.update()` you just need to manually update the time yourself, e.g. `updated=datetime.datetime.now()` . – mynameistechno Jun 21 '15 at 02:27
  • In addition to @mynameistechno answer, if you use `.save(update_fields['some_field'])` the `updated` field will **not** be updated (!). Thus, either use `.save()` or the `datetime` workaround mentioned. – nik_m Jan 03 '17 at 16:38