0
Manufacturer
==========================
id            name      
--------------------------
1             Company Inc.
2             Google Test.
3             3M (UNITY) USA. INC.
4             CE EE

Say, I have a string 'Google Test. 1257 SCS RANDOM 31233DD' and I want to find all rows in table manufacturer where ht name is part of the given string:

SELECT * FROM manufacturer
WHERE 'Google Test. 1257 SCS RANDOM 31233DD' ILIKE '%' || name || '%' 

Correctly returns:

id            name      
--------------------------
2             Google Test.

But when I do:

SELECT * FROM manufacturer
WHERE '3dad QTICE EEN ' ILIKE  '%' || name || '%'

it returns:

id            name      
--------------------------
4             CE EE

I don't want partial matches like this. The name shall not match in the middle of a word. I tried substring():

SELECT * from manufacturer
WHERE  SUBSTRING('Google Test. 1257 SCS RANDOM 31233DD' from name) != '';

But I get:

ERROR: invalid regular expression: quantifier operand invalid

Unfortunately I don't have the exact spec to go off since I am querying this off external db. But from what I have seen, column is varchar(256). All values are upper cased and use plain spaces. All start with either character or number and end with either number, char, or special character. Ex: 'CLEVLAND DRILL (GREEN)'. There are special characters in the value, such as ,.()&/

I am not really looking for efficiency as long as it doesn't take over 50ms to do one query.

As of right now, there are about 10000+ entries but it could def grow over time.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Av0id
  • 33
  • 11
  • To get a precise answer, you have to give a precise definition of "word". Are words always separated by plain spaces? What about start / end? Dash? Other punctuation? Case sensitive? Numbers? Stemming? And then: is the table big and performance important? Then provide your version of Postgres, table definition, avg. string length, cardinality. – Erwin Brandstetter Apr 10 '19 at 23:34
  • @ErwinBrandstetter Hi sorry about that. I added the spec as best as I could from my observation. – Av0id Apr 11 '19 at 16:52

3 Answers3

2

One method with LIKE is to add spaces to the begining and end:

SELECT *
FROM db
WHERE ' ' || '3dad QTICE EEN ' || ' ' ILIKE  '% ' || manufacturer || ' %'

If you need more sophisticated matching, then you might need to use regular expressions with word boundaries.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
2

To solve this problem you really need to use regex, as adding a space either side of the string will not match at the beginning or end of the line. By using regex, we can check for that situation too. For example:

SELECT *
FROM manufacturer
WHERE 'Google Test. 1257 36700 SCS RANDOM WORD 31233DD' ~ ('(^| )' || name || '( |$)');

Output:

id  name
2   Google Test.

Query:

SELECT *
FROM manufacturer
WHERE '3dad QTICE EEN ' ~ ('(^| )' || name || '( |$)');

Output:

There are no results to be displayed.

Query:

SELECT *
FROM manufacturer
WHERE 'CE EE ' ~ ('(^| )' || name || '( |$)');

Output:

id  name
4   CE EE

Demo on dbfiddle

Update

Because the name values in the table can contain characters that have special meaning in a regex, they need to be escaped before the name is included into the regex. You can do this with REGEXP_REPLACE e.g.

REGEXP_REPLACE(name, '([\\.+*?[^\]$(){}=!<>|:\-#])', '\\\1', 'g')

So your query should be:

SELECT *
FROM manufacturer
WHERE 'Google Test. 1257 36700 SCS RANDOM WORD 31233DD' ~ ('(^| )' || REGEXP_REPLACE(name, '([\\.+*?[^\]$(){}=!<>|:\-#])', '\\\1', 'g') || '( |$)');

Updated demo

Nick
  • 138,499
  • 22
  • 57
  • 95
  • so I tried using regex and I get an error 'ERROR: invalid regular expression: quantifier operand invalid'. Could this be due to special char in the value? I added spec to my post. Thanks – Av0id Apr 11 '19 at 16:55
  • @Av0id I've updated my answer to deal with that situation – Nick Apr 11 '19 at 23:14
2

All the values start with either character or a number and end with either number, char, or special character. ... There are special characters in the value, such as ,.()&/.

I suggest the regular expression match operator ~. Carefully define boundaries and escape special characters in name:

Create once:

CREATE OR REPLACE FUNCTION f_regexp_escape(text)
  RETURNS text AS
$func$
SELECT regexp_replace($1, '([!$()*+.:<=>?[\\\]^{|}-])', '\\\1', 'g')
$func$  LANGUAGE sql IMMUTABLE;

Then:

SELECT * FROM manufacturer
WHERE  '3dad QTICE EEN ' ~ ('\m' || f_regexp_escape(name) || '( |$)')

How? Why?

\m .. beginning of a word. Works, since: values start with either character or number
( |$) .. a space or the end of the string. We need this since values: end with either number, char, or special character

The content of manufacturer.name is the core of the pattern. You want the literal meaning of all its characters, so strip any special meaning by escaping properly. This is true for LIKE (few special characters) as well as the regular expression match operator ~ (more special characters). Often overlooked and quite a pitfall. That got you (and the tricky definition of bounds). Read this!

And then use the function f_regexp_escape() as demonstrated. A name like:

3M (UNITY) USA. INC.

becomes:

3M \(UNITY\) USA\. INC\.

Might be convenient to store readily escaped patterns in table manufacturer, maybe as additional column. And maybe with added padding like this:

\m3M \(UNITY\) USA\. INC\.( |$)

Or generate the pattern on the fly like demonstrated.

This way name can be a single word or a whole phrase, and end with any characters. But start and end never match in the middle of a "word" on the other side.

There is an arsenal of other pattern matching tools in Postgres:

If your table is big, consider the full text search infrastructure with optimized indexes and phrase search capability:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228