Flask-WhooshAlchemy with existing database

Question

How can I get Flask-WhooshAlchemy to create the .seg files for an already existing database filled with records? By calling:

with app.app_context():
    whooshalchemy.whoosh_index(app, MappedClass)

I can get the .toc file, but the .seg files will only be created and once I insert a record directly via Flask-WhooshAlchemy interface. Thus all already existing records will never be included in a whoosh search.

I ended up re-inserting all records, because I couldn't find out how to do it. — Sebastian Elsner, Aug 19 '14 at 05:11

score 3 · Answer 1 · answered Feb 03 '16 at 05:13

Flask-WhooshAlchemy seems not maintained

you can also try my fork https://github.com/Revolution1/Flask-WhooshAlchemyPlus

just simply:

pip install flask-whooshalchemyplus

from flask-whooshalchemyplus import index_all

index_all(app)

I also add some new feature and fixed a lot bugs.

thanks:)

score 2 · Accepted Answer · answered Mar 08 '15 at 18:16

Here is a script that indexes an existing database. FWIW, Whoosh refers to that as "batch indexing".

This is a little rough, but it works:

#!/usr/bin/env python2

import os
import sys
import app
from models import YourModel as Model
from flask.ext.whooshalchemy import whoosh_index

sys.stdout  = os.fdopen(sys.stdout.fileno(), 'w', 0)
atatime     = 512

with app.app_context():
    index       = whoosh_index(app, Model)
    searchable  = Model.__searchable__
    print 'counting rows...'
    total       = int(Model.query.order_by(None).count())
    done        = 0
    print 'total rows: {}'.format(total)
    writer = index.writer(limitmb=10000, procs=16, multisegment=True)
    for p in Model.query.yield_per( atatime ):
        record = dict([(s, p.__dict__[s]) for s in searchable])
        record.update({'id' : unicode(p.id)}) # id is mandatory, or whoosh won't work
        writer.add_document(**record)
        done += 1
        if done % atatime == 0:
            print 'c {}/{} ({}%)'.format(done, total, round((float(done)/total)*100,2) ),

    print '{}/{} ({}%)'.format(done, total, round((float(done)/total)*100,2) )
    writer.commit()

You may want to play with the the parameters:

atatime - the number of records to pull from the database at once
limitmb - "max" megabytes to use
procs - cores to use in parallel

I used this to index around 360,000 records on an 8-core AWS instance. It took about 4 minutes, most of which was waiting for the (single-threaded) commit().

Thank you! This looks useful. I finally switched to PostgreSQL and it's awesome TSVector for the project. — Sebastian Elsner, Mar 10 '15 at 10:01

Flask-WhooshAlchemy with existing database

2 Answers2

Linked