1

If I have a string 'this is a goodish string'
And I have three records in the database

1, "good"
2, "fart"
3, "this"

is there a way to do a lookup for records that are substrings of my source..

I ask because I have a black_list table that contains entries that I never want to allow my user to store. Like if they wanted to store www.google.com, and I had "google" in my black list, I'd get a positive result from exists, and know to not let them..

I'll keep digging and I might post my answer here, but it's an interesting thing to figure out. I'd rather do it in the DB layer than in the ruby/rails layer.

baash05
  • 4,394
  • 11
  • 59
  • 97

2 Answers2

2

If you only want to match whole words, this is going to be considerably faster with a bigger table because it can use an index on entry (unlike your solution, where the predicate is not sargable):

Simple

Assuming white space as word separator (\s+).

SELECT *
FROM   black_list 
JOIN   regexp_split_to_table('the task description', '\s+') entry USING (entry)

Advanced

For more specific needs, you can tailor the words you extract:

Split words (as defined by the POSIX standard) at \m. The manual:

matches only at the beginning of a word

Remove leading, trailing or all characters you want to ignore. Since we split at the beginning of words, trailing characters may be enough.

Finally cast to lower case - assuming lower case entries in black_list:

Basic and fast

Split words and trim given trailing characters:

SELECT *
FROM   black_list 
JOIN  (
   SELECT rtrim(w, ' /\,.-') AS entry -- add more?
   FROM   regexp_split_to_table('Oh - my, a4b smurf-village in/out.', '\m') w
   ) w USING (entry);

Thorough

Split words, remove all non-word characters and convert to lower case:

SELECT *
FROM   black_list 
JOIN  (
   SELECT lower(regexp_replace(w, '\W+', '')) AS entry
   FROM   regexp_split_to_table('Oh - my, a4b smurf-village in/out.', '\m') w
   ) w USING (entry);

SQL Fiddle

Beyond

You can take it one step further and test for similarity to also catch typos. But you have to weigh alpha against beta error (incorrectly excluding legal words).

SELECT set_limit(0.9); -- high example value

SELECT *
FROM   black_list b
JOIN  (
   SELECT lower(regexp_replace(w, '\W+', '')) AS entry
   FROM   regexp_split_to_table('Oh - my, a4b smurf-village in/out.', '\m') w
   ) w ON b.entry % w.entry

The % operator requires the additional module pg_trgm. Details:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • This is a great answer.. sadly I need to match smurf-village and smurfs with an entry of smurf. Happily my entries will be likely stay under a hundred. We're only trying to create a black list for really persistent offenders of a sit policy. Wack-a-mole, but we punish rather well. – baash05 Jan 29 '15 at 06:50
  • Oh.. +1, because you're totally right.. it would be faster for whole word entries. thanks – baash05 Jan 29 '15 at 06:51
0

It took far less time than I thought.. mostly because I just tried out different things.

select * from black_list 
where 'this is the task description' like concat('%', entry, '%')

For the rails dev's out there

 scope :part_of, ->(source) { where("? like concat('%', entry, '%')", source) }


 BlackList.part_of('this is a test fire').exists?
 BlackList.part_of(task.description).exists?
baash05
  • 4,394
  • 11
  • 59
  • 97