100-1000+ Term Search For Lucene / Sphinx / Mysql

Question

I'm currently building a database that will monitor company records. I need to search for MANY (we're talking potentially a few thousand when it's finished) company names against a single column in a table where the company names will be listed. I currently use the following basic mysql LIKE query (this is condensed, the current one has 300+ terms being searched for):

SELECT * FROM "case-file-owner" WHERE "party-name" LIKE 'Nike%' OR
"party-name" LIKE 'Lyon Group Inc.%' OR "party-name" LIKE 'Target Home%'
ORDER BY "party-name" ASC

As you can see, it's a bit basic! How would you do a search like this with Lucene or Sphinx? Are these tools overkill for this task? While I need to search for potentially thousands of companies, the current setup only searches a table containing maybe 1000 records each day (I run a query that cuts out records I don't need to search).

My primary issue revolves around minor issues. For example: Company1, Inc. Company1 Inc. The Company1 Inc. The Company1 Incorporated. are all possible representations of many company names I need to find.

Would MATCH work? Would a gigantic query like this take forever with Lucene or Sphinx? My current LIKE search is "optimized" by dumping any leading wildcard operators... but hardly intelligent!

Eric J. · Answer 1 · 2015-11-17T01:02:44.947

1

My primary issue revolves around minor issues. For example: Company1, Inc. Company1 Inc. The Company1 Inc. The Company1 Incorporated. are all possible representations of many company names I need to find.

In general, you will run into problems trying to pattern match variable manifestations of company names. There's a very real chance that you will produce false positives. For example 'Nike%' might mean to search for variants related to Nike Shoes, but would also match Nikeno Inc (if there were such a company).

If possible, I would suggest adding a column normalized-party-name where you store a single search term that represents each company, e.g.

normalized-party-name  party-name
COMPANY INC            Company1, Inc. 
COMPANY INC            Company1 Inc. 
COMPANY INC            The Company1 Inc.

The biggest issue however is the sleight variations like commas, periods, etc. I can create multiple company name variations fairly easily

You can remove all punctuation from the normalized name. You might also try using Sphinx with the Soundex morphology. Soundex ignores punctuation when looking for a match. You could then use the extended matching mode to search just the unique variants you get ignoring punctuation. Lucerne may have something similar, but I have not worked with it.

edited Nov 17 '15 at 01:02

answered Nov 16 '15 at 23:53

Eric J.

147,927
63
340
553

I can't do this. I do not control data entry, these are public records :( – photocode Nov 16 '15 at 23:54
Can you then create a table on your end that maps a normalized name to the public-record name and join to your own table when doing this type of filtering? – Eric J. Nov 17 '15 at 00:07
I suppose I could add my own table - I've rebuilt their database locally. The biggest issue however is the sleight variations like commas, periods, etc. I can create multiple company name variations fairly easily for the saved queries. I suppose running a command to remove these characters would be a start. How would I do that for multiple rows? I assume an actual editing command followed by a search would work faster than trying to manipulate a string, then search, over and over. As for Lucene, wouldn't that be easier? I feel like it should be a better option. – photocode Nov 17 '15 at 00:25
I'm gonna look into sphinx more for this. One issue I see though is mapping the non-normal name to the normal name. It would need to be a manual thing built beforehand, correct? – photocode Nov 17 '15 at 01:43
For anyone reading this, I found a cool function that sanitizes strings of all non-alphanumeric characters: http://stackoverflow.com/questions/6942973/mysql-how-to-remove-all-non-alpha-numeric-characters-from-a-string – photocode Nov 17 '15 at 01:44

100-1000+ Term Search For Lucene / Sphinx / Mysql

1 Answers1