Matching a substring against a stored list of strings

Question

So, after the recent Twitch leak everyone is discussing their favorite bits of bad code. One bit that stood out was a single monster SQL statement to spot "illegal terms". That got me thinking about how to implement this "correctly".

So you have a table of strings that you want to match sub-strings against, how you write this in PL/pgSQL? I am assuming any SQL implementation should have procedural capabilities to create functions/procedures for such meta-programming, basically creating and executing SQL like the following:

admin@localhost:words> SELECT 'XabcY' LIKE '%abc%' OR 'XabcY' LIKE '%xyz%' as matches;
+-----------+
| matches   |
|-----------|
| True      |
+-----------+

So to be more specific, given a list of strings in table disallowed:

| illegal_string |
|----------------|
| stupid         |
| witless        |
| moron          |
| commie-lover   |

How would you create a dynamic query in PL/pgSQL that when executed returned true if any of these matched a given string? It does not need to use ILIKE in the Twitch like to check if the given word contained the substring, so using position is fine too, but it should be performant/tuneable using gin indices and whatnot.

Related issues.

Pavel Stehule · Answer 1 · 2021-10-12T14:28:50.263

There are two possibilities:

you can use LIKE ANY(array) operator

postgres=# select 'Ahoj' like any (ARRAY['Ah%', 'Na%']);
┌──────────┐
│ ?column? │
╞══════════╡
│ t        │
└──────────┘
(1 row)

postgres=# select 'Nazdar' like any (ARRAY['Ah%', 'Na%']);
┌──────────┐
│ ?column? │
╞══════════╡
│ t        │
└──────────┘
(1 row)

you can use regular expression:

postgres=# select 'Ahoj' ~ '^(Ah|Na)';
┌──────────┐
│ ?column? │
╞══════════╡
│ t        │
└──────────┘
(1 row)

postgres=# select 'Nazdar' ~ '^(Ah|Na)';
┌──────────┐
│ ?column? │
╞══════════╡
│ t        │
└──────────┘
(1 row)

So in the end you don't need dynamic SQL.

There are ANSI/SQL syntax too:

postgres=# select 'Nazdar' similar to '(Ah|Na)%';
┌──────────┐
│ ?column? │
╞══════════╡
│ t        │
└──────────┘
(1 row)

So you can write something like:

DECLARE pw text[];
BEGIN
  pw := (SELECT array_agg('%' || disallowed || '%'
            FROM disallowed);
  IF EXISTS(SELECT * FROM foo WHERE c LIKE ANY (pw)) THEN
    RAISE NOTICE 'there are some disallowed words';
  END IF;
  ...

I am not sure about an performance. On larger table you need trigram index, or maybe better is using fulltext instead substring searching.

> So in the end you don't need dynamic SQL. Not sure I get you. How can I create the regex in the second option from a table? The regex needs to be built up from some kind of SELECT. With regards to the first option, I assume `ARRAY['Ah%', 'Na%']);` could be created from a `SELECT` on `illegal_words`. Not sure I get how the last procedural function works. What is `disallowed` referring to? — oligofren, Oct 12 '21 at 10:55
@oligofren - there was a bug in this example. I fixed it. you can create fulltext index over your table, and very similarly you can write a query for searching some broken words. https://www.postgresql.org/docs/current/textsearch.html — Pavel Stehule, Oct 12 '21 at 14:31
That might be, but could you expand version two to a full example? I do not understand how one would construct `select 'Ahoj' ~ '^(Ah|Na)'` from selecting on `illegal_words`, which is basically the original question :) — oligofren, Oct 12 '21 at 15:08
@oligofren - you can use subselect `select 'Ahoj' ~ (select string_agg(ilegal_words, '|') from disallowed) from ...` — Pavel Stehule, Oct 12 '21 at 17:22

Stefanov.sm · Answer 2 · 2021-10-14T19:09:13.027

I do not think that you need PL/pgSQL or dynamic SQL for this. Make a set of text from the table of disallowed words and match against the set items as regular expressions. Maybe not very performant using regexp magic, but I hope straightforward. The query below can of course be parameterized.

select '<text to examine>' ~* 
  any(select illegal_string from disallowed) as rude;

select 'You Stupido MORONE!' ~* 
  any(select illegal_string from disallowed) as rude;
-- yields true.

You may wish to restrict the search to whole words only. Then play with the regexp set and shape it like this:

select 'You Stupido MORONE!' ~* 
  any(select '\m'||illegal_string||'\M' from disallowed) as rude;
-- yields false

SQL Fiddle here

Matching a substring against a stored list of strings

2 Answers2