4

I'm trying to clean up data in a PostgreSQL table, where some records have a large number of profanities in the email_address column (the records in question have been entered by agitated users as a result of frustration due to a bug that has since been fixed):

    ┌───────────────────┐
    │   email_address   │
    ├───────────────────┤
    │ foo@go.bar.me.net │
    │ foo@foo.com       │
    │ foo@example.com   │
    │ baz@example.com   │
    │ barred@qux.com    │
    └───────────────────┘

Desired query output

I'd like to build a query that annotates each row from the data table with a profanity score, and orders the records by the score, so that a human can go through the annotated data (presented in a web app) and take necessary action:

    ┌───────────────────┬───────┐
    │ email_address     │ score │
    ├───────────────────┼───────┤
    │ foo@foo.com       │    18 │
    │ foo@go.bar.me.net │    14 │
    │ foo@example.com   │     9 │
    │ baz@example.com   │     3 │
    │ barred@qux.com    │     0 │
    └───────────────────┴───────┘

Attempt #1

The approach I'm taking is to build a list of regular expressions (now I have 2 problems...) and scores, whereby very profane words will contribute a large profanity score if that word is found in the email_address column. My profanities table looks something like this:

    ┌──────────────────┬───────┐
    │ profanity_regexp │ score │
    ├──────────────────┼───────┤
    │ foo              │     9 │
    │ bar(?!red)       │     5 │
    │ baz              │     3 │
    └──────────────────┴───────┘

LATERAL JOIN

I've found that I can use a LATERAL join over the regexp_matches function to extract all profanities from each email_address (but records with no profanities are discarded):

SELECT
    data.email_address,
    array_agg(matches)
FROM
    data,
    profanities p,
    LATERAL regexp_matches(data.email_address, p.posix_regexp, 'gi') matches
GROUP BY
    data.email_address;

This produces the following result:

    ┌───────────────────┬───────────────────┐
    │   email_address   │ profanities_found │
    ├───────────────────┼───────────────────┤
    │ foo@foo.com       │ {{foo},{foo}}     │
    │ foo@example.com   │ {{foo}}           │
    │ foo@go.bar.me.net │ {{foo},{bar}}     │
    │ baz@example.com   │ {{baz}}           │
    └───────────────────┴───────────────────┘

SUB-SELECT

I also figured out how to get an array of profanity score subtotals for each record with this SQL:

SELECT
    data.email_address,
    array(
        SELECT score * ( 
            SELECT COUNT(*)
            FROM (SELECT
                regexp_matches(data.email_address, p.posix_regexp, 'gi')
            ) matches
        )
        FROM profanities p
    ) prof
from data;

Which correctly yields all rows (including rows without profanities) as such:

    ┌───────────────────┬──────────┐
    │   email_address   │   prof   │
    ├───────────────────┼──────────┤
    │ foo@go.bar.me.net │ {9,5,0}  │
    │ foo@foo.com       │ {18,0,0} │
    │ foo@example.com   │ {9,0,0}  │
    │ baz@example.com   │ {0,0,3}  │
    │ barred@qux.com    │ {0,0,0}  │
    └───────────────────┴──────────┘

Problem

How do I sum the result of a lateral join to get the desired output?

Is there another strategy I can use to get the desired result?


I've posted a live code fiddle for this question at http://sqlfiddle.com/#!17/6685c/4

Tyson
  • 405
  • 4
  • 13

3 Answers3

1

Add another select to your query. The current query is fine but you just need to sum the array.

SELECT email_address,
(
    SELECT SUM(s)
    FROM
        UNNEST(prof.profanity_score_subtotals) s
) AS sum_prof FROM (
    SELECT
        data.email_address,
        array(
            SELECT score * ( 
                SELECT COUNT(*)
                FROM (SELECT
                    regexp_matches(data.email_address, p.profanity_regexp, 'gi')
                ) matches
            )
            FROM profanities p
        ) profanity_score_subtotals
    FROM data
) prof;
Tyson
  • 405
  • 4
  • 13
Petru Scurtu
  • 363
  • 6
  • 17
  • I'm having trouble getting this to work, though I understand using `UNNEST` is the key to your answer, combined with the sub-select query I posted in the question. My interpretation of your answer is at http://sqlfiddle.com/#!17/6685c/17 -- want to expand your answer? – Tyson Mar 07 '18 at 09:23
  • I took a look at the fiddle. Maybe I misunderstood your requirements but it seems to be working fine. The final query outputs exactly the desired output. – Petru Scurtu Mar 07 '18 at 09:53
  • Weird - I looked again and indeed it works. I've no idea why it wasn't working earlier. Thanks! – Tyson Mar 07 '18 at 14:58
1

For some reason postgres doesn't allow you to use set-returning functions as part of a where clause, so you need to do two lateral joins:

SELECT
    data.email_address,
    t.score
FROM
    data,
    LATERAL (
        SELECT
            coalesce(sum(s.score), 0) AS score
        FROM
            profanities,
            LATERAL (
                SELECT
                    profanities.score * array_length(
                        regexp_matches(
                            data.email_address,
                            profanities.profanity_regexp,
                            'gi'
                        ),
                        1
                    ) score
            ) s
    ) t;
Tyson
  • 405
  • 4
  • 13
daurnimator
  • 4,091
  • 18
  • 34
  • I like this answer better as it worked out of the box, and it uses lateral joins as asked in the title of the question. – Tyson Mar 07 '18 at 15:11
  • Turns out the extra `LATERAL` join isn't necessary, see my answer for a slightly faster solution. – Tyson Mar 08 '18 at 06:40
1

I had previously accepted the answer by @daurnimator, but then found the extra LATERAL join isn't necessary. Here's what I ended up using in my app:

SELECT
    data.email_address,
    (
        SELECT
            coalesce(sum(s.score), 0) AS score
        FROM
            profanities,
            LATERAL (
                SELECT
                    profanities.score * array_length(
                        regexp_matches(
                            data.email_address,
                            profanities.profanity_regexp,
                            'gi'
                        ),
                        1
                    ) score
            ) s
    ) AS score
FROM
    data;

It also turns out my version is slightly faster since it avoids a nested loop in the query. Another advantage is that it can be used as an annotation with Django's RawSQL function in my application, allowing me to then order_by('-score') and show the most profane entries first.

Tyson
  • 405
  • 4
  • 13
  • 1
    Usually you'd want to use the score as an additional filter; e.g. only return email addresses where the score is at least 50. With my approach you're able to use that in the `where` clause. Of course you could move your query to a subselect..... – daurnimator Mar 08 '18 at 09:54