We are building real time search feature for institutions, the index is based on the user uploaded files (mostly are Word/Excel/PDF/PowerPoint, and ASCII files). The I/O is expected at only 10 IOPS -20 IOPS but it can vary depends on the date. Maximum I/O could be 100 IOPS. Current database size is reaching 10GB, it's 4 months old.
For real time search server, I'm considering Solr / Lucene and probably ElasticSearch. But the challenge is how to index these files FAST, so that search server can query the index in real time.
I have found some similar questions on how to index .doc/.xls/.pdf, but they did not mention how to ensure indexing performance:
- Search for keywords in Word documents and index them
- Index Word/PDF Documents From File System To SQL Server
- How to extract text from MS office documents in C#
- Using full-text search with PDF files in SQL Server 2005
So my question is: how to build the index FAST ?
Any suggestion on the architecture ? Should I focus on building fast infrastructure (i.e. RAID, SSD, more CPU, Network bandwidth ?) or focus on the index tools & algorithm?