2

We are building real time search feature for institutions, the index is based on the user uploaded files (mostly are Word/Excel/PDF/PowerPoint, and ASCII files). The I/O is expected at only 10 IOPS -20 IOPS but it can vary depends on the date. Maximum I/O could be 100 IOPS. Current database size is reaching 10GB, it's 4 months old.

For real time search server, I'm considering Solr / Lucene and probably ElasticSearch. But the challenge is how to index these files FAST, so that search server can query the index in real time.

I have found some similar questions on how to index .doc/.xls/.pdf, but they did not mention how to ensure indexing performance:

So my question is: how to build the index FAST ?

Any suggestion on the architecture ? Should I focus on building fast infrastructure (i.e. RAID, SSD, more CPU, Network bandwidth ?) or focus on the index tools & algorithm?

Community
  • 1
  • 1
Dio Phung
  • 5,944
  • 5
  • 37
  • 55

1 Answers1

1

We're building a high perfomance full-text search for office documents. We can share some insights:

Hope it helps!

Ilia P
  • 616
  • 5
  • 16