Indexing large DB's with Lucene/PHP

Question

Afternoon chaps,

Trying to index a 1.7million row table with the Zend port of Lucene. On small tests of a few thousand rows its worked perfectly, but as soon as I try and up the rows to a few tens of thousands, it times out. Obviously, I could increase the time php allows the script to run, but seeing as 360 seconds gets me ~10,000 rows, I'd hate to think how many seconds it'd take to do 1.7million.

I've also tried making the script run a few thousand, refresh, and then run the next few thousand, but doing this clears the index each time.

Any ideas guys?

Thanks :)

A few 10 thousand rows is still microscopic, if you're timing out something really wrong is going on, please post your SQL and maybe even your server info (cpu, ram). — TravisO, Apr 14 '10 at 15:34

score 3 · Accepted Answer · edited May 23 '17 at 12:07

3

I'm sorry to say it, because the developer of Zend_Search_Lucene is a friend and he has worked really hard it, but unfortunately it's not suitable to create indexes on data sets of any nontrivial size.

Use Apache Solr to create indexes. I have tested that Solr runs more than 300x faster than Zend for creating indexes.

You could use Zend_Search_Lucene to issue queries against the index you created with Apache Solr.

Of course you could also use the PHP PECL Solr extension, which I would recommend.

edited May 23 '17 at 12:07

Community

1
1

answered Apr 17 '10 at 00:40

Bill Karwin

538,548
86
673
828

Bill, thanks for the heads up, I've moved over to using Solr now and have got 1.7million rows indexing in roughly 8 minutes; doesn't that make you feel warm inside! I have run into one problem though... http://stackoverflow.com/questions/2668279/using-solr-and-zends-lucene-port-together but it was definately the right move, thank you! – Tom Apr 20 '10 at 14:20

score 0 · Answer 2 · answered Apr 14 '10 at 14:55

0

Try speeding it up by selecting only the fields you require from that table.

If this is something to run as a cronjob, or a worker, then it must be running from the CLI and for that I don't see why changing the timeout would be a bad thing. You only have to build the index once. After that new records or updates to them are only small updates to your Lucene database.

answered Apr 14 '10 at 14:55

Htbaa

2,319
18
28

Thanks for the quick reply Htbaa :) I've just got rid of 2 fields, so its now only 3 (id, company name and postcode). I'm just running some tests and I'll see how that improves things. Ramping up the run time and letting it take as long as it takes is an option, although its a pain for testing ;) I need to look into updating Lucene index's, as I was unaware that you actually could? Thanks again! – Tom Apr 14 '10 at 15:05
See http://framework.zend.com/manual/en/zend.search.lucene.index-creation.html for updating documents and the index. – Htbaa Apr 14 '10 at 19:38

Tom · Answer 3 · 2010-04-16T12:36:19.003

Some info for you all - posting as an answer so I can use the code styles.

$sql = "SELECT id, company, psearch FROM businesses";
$result = $db->query($sql);     // Run SQL

$feeds = array();

$x = 0;
while ( $record = $result->fetch_assoc() ) {
    $feeds[$x]['id'] = $record['id'];
    $feeds[$x]['company'] = $record['company'];
    $feeds[$x]['psearch'] = $record['psearch'];
    $x++;   
}

//grab each feed

foreach($feeds as $feed) {  
  $doc = new Zend_Search_Lucene_Document();  

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('id',  
    $feed["id"]));  

  $doc->addField(Zend_Search_Lucene_Field::Text('company',  
    $feed["company"]));  

    $doc->addField(Zend_Search_Lucene_Field::Text('psearch',  
    $feed["psearch"]));  

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('link',  
    'http://www.google.com'));  


  //echo "Adding: ". $feed["company"] ."-".$feed['pcode']."\n";  

  $index->addDocument($doc);  
}  



$index->commit();

(I've used google.com as a temp link)

The server its running on is a local install of Ubuntu 8.10, 3Gb RAM and a Dual Pentium 3.2GHz chip.

Why are you double looping? Seems like you could get by with a single loop just fine. — Glen Solsberry, Apr 16 '10 at 12:42
i'd like to share with you some performance tips : 1- you are getting into 2 loops , each of them has 1.7 million record ......... 2- Zend_Db_Table_Abstract::setDefaultMetadataCache($cache); which will notably speed up your querying — tawfekov, Apr 16 '10 at 12:56

Indexing large DB's with Lucene/PHP

3 Answers3