23

We have OCRed thousands of pages of newspaper articles. The newspaper, issue, date, page number and OCRed text of each page has been put into a mySQL database.

We now want to build a Google-like search engine in PHP to find the pages given a query. It's got to be fast, and take no more than a second for any search.

How should we do it?

lkessler
  • 19,819
  • 36
  • 132
  • 203
  • 3
    What makes Google different from plain text search engines is that it studies the relationships between pages. How would you be relating your pages to each other ? Links ? Key words/phrases ? If you do not have any kind of relationships, you'd be better off with a text search. – Adam Pierce Feb 02 '09 at 05:14
  • 1
    Our database of 50,000 items takes mySQL about 20 seconds to do a plain text search. Our OCRed newspaper pages are a much larger dataset. We need faster Google-like methods of indexing and retrieval to search our newspapers in under a second. – lkessler Feb 02 '09 at 05:46
  • search engines do not use sql databases as they makes search slow. You can use Lucene or code your own search engine. php is not suitable language for developing a search engine. – alienCoder Feb 20 '13 at 11:38
  • 3
    I asked this question over 5 years ago. It had 26,000 views and 17 upvotes. So now, 5 years later, you decide it's too broad and you put it on hold???? – lkessler Sep 05 '14 at 01:43
  • This should be open again, and discussed more! – Miron Jan 17 '18 at 09:02

9 Answers9

15

You can also try out SphinxSearch. Craigslist uses sphinx and it can connect to both mysql and postgresql.

cnu
  • 36,135
  • 23
  • 65
  • 63
  • hi, I created lots of web page, i want search any word within my pages.. so ,you are all answer useful for me? thanks – pcs May 20 '15 at 11:12
10

There are some interesting search engines for you to take a look at. I don't know what you mean by "Google like" so I'm just going to ignore that part.

  • Take a look at the Lucene engine. The original is high performance but written in Java. There is a port of Lucene to PHP (already mentioned elsewhere) but it is too slow.
  • Take a serious look at the Xapian Project. It's fast. It's written in C++ so you'll most probably have to build it for your target server(s) but has PHP bindings.
Glenn
  • 7,874
  • 3
  • 29
  • 38
10

If MySQL's fulltext search is taking 20 seconds per query, you either have it misconfigured or running on underpowered hardware - some big sites are successfully using plain old MyISAM searching.

My vote goes for Solr, however. It's based on Lucene, so you get all the richness and performance of that best of breed product, but with a RESTful API, making it very easily from PHP. There's even a dW article.

James Brady
  • 27,032
  • 8
  • 51
  • 59
  • 1
    I agree. Go with SOLR all the way. Integrated PHP and SOLR many times and it's worth the time. – Rafael Sanches Feb 21 '12 at 13:46
  • Yeah 20 seconds for MySQL full text search indicates something is broken. It should take about 0.01 to 0.05 seconds SQL + page render time total for full text on > 250,000 rows even on a very low end system (single core, 512 MB ram) - even doing multiple LIKE statements for each keywords on a DB with about 250,000 shouldn't take more than a second. Sounds most likely that either the columns are just not indexed all. For < 250,00 rows looping over all the matched results in PHP to rank them intelligently should still be sub-second. – Iain Collins Apr 17 '12 at 23:35
  • hi, I created lots of web page, i want search any word within my pages.. so ,you are all answer useful for me? thanks – pcs May 20 '15 at 11:12
4

You could put all the files on Google Docs, then scrape the results to your own web site.

My concern is that OCR accuracy is still an issue, so one consideration for a search requirement is the ability to perform "fuzzy" searches. Fuzzy meaning when the OCR incorrectly recognizes the word "hat" for "hot", the search engine will be smart enough to return results that are similar but not exact. In Oracle, there is a function called UTL_MATCH that compares the similarity between two strings: http://docs.oracle.com/cd/E11882_01/appdev.112/e25788/u_match.htm#ARPLS352

A function like this would be useful.

Sun
  • 2,595
  • 1
  • 26
  • 43
2

Your scenario suggest, that you'd like to roll your own; good starting points for a general search engine would include:

If you want to use an off-shelf solution:

Silver Dragon
  • 5,480
  • 6
  • 41
  • 73
2

Why don't you try something like Google Search Appliance or Google Enterprise? It will have cost associated but then it will save you from re-inventing the wheel and give you "google like" search.

Pradeep
  • 3,258
  • 1
  • 23
  • 36
  • We would prefer to stick with PHP and mySQL because the database has cross purposes and needs to be integrated with the rest of our website. – lkessler Feb 02 '09 at 05:48
1

Check this Lucene port for PHP:

Christian C. Salvadó
  • 807,428
  • 183
  • 922
  • 838
1

You might want to check Sphider. In my experience it is quite fast and does the indexing automatically. It is also open source so you could take the code and modify it for your needs.

Darryl Hein
  • 142,451
  • 95
  • 218
  • 261
0

sqlite has quite good full text search capability (look up sqlite FTS 3/4 - its surprisingly good)

if you want simple a PHP diy approach indexing using up of lots of small files split by a hash of the terms being indexed can work very well amd searching can be very fast even in php if you take care designing it. (the idea is to make a search on a term only need to search a very small file containing terms matching the hash and record id's - you could use bitarray slices to represent record ids if you want to save HD space) .. but doing the indexing of every word for fulltext would be slow in php .. that part should really be done in c

for "Fuzzy" searches maybe look at using metaphone hashes.

for pre-built fulltext tools check out these: sqlite FTS 3/4 (sqlite has very good fulltext search capability!), Sphinx, kinoSearch (kinoSearch is a bit like Lucene but the back-end is c with a nice easy perl wrapper - there is also cLucene but I think thats still pre-alpha)

Java Lucene (or anything Java-based) probably needs a lot of ram to to be set aside to run a JVM - so probably not so great if you are on a budget

Michael MD
  • 21
  • 2