I'm trying to clean up data in a PostgreSQL table, where some records have a large number of profanities in the email_address
column (the records in question have been entered by agitated users as a result of frustration due to a bug that has since been fixed):
┌───────────────────┐ │ email_address │ ├───────────────────┤ │ foo@go.bar.me.net │ │ foo@foo.com │ │ foo@example.com │ │ baz@example.com │ │ barred@qux.com │ └───────────────────┘
Desired query output
I'd like to build a query that annotates each row from the data table with a profanity score, and orders the records by the score, so that a human can go through the annotated data (presented in a web app) and take necessary action:
┌───────────────────┬───────┐ │ email_address │ score │ ├───────────────────┼───────┤ │ foo@foo.com │ 18 │ │ foo@go.bar.me.net │ 14 │ │ foo@example.com │ 9 │ │ baz@example.com │ 3 │ │ barred@qux.com │ 0 │ └───────────────────┴───────┘
Attempt #1
The approach I'm taking is to build a list of regular expressions (now I have 2 problems...) and scores, whereby very profane words will contribute a large profanity score if that word is found in the email_address column. My profanities
table looks something like this:
┌──────────────────┬───────┐ │ profanity_regexp │ score │ ├──────────────────┼───────┤ │ foo │ 9 │ │ bar(?!red) │ 5 │ │ baz │ 3 │ └──────────────────┴───────┘
LATERAL JOIN
I've found that I can use a LATERAL
join over the regexp_matches
function to extract all profanities from each email_address
(but records with no profanities are discarded):
SELECT
data.email_address,
array_agg(matches)
FROM
data,
profanities p,
LATERAL regexp_matches(data.email_address, p.posix_regexp, 'gi') matches
GROUP BY
data.email_address;
This produces the following result:
┌───────────────────┬───────────────────┐ │ email_address │ profanities_found │ ├───────────────────┼───────────────────┤ │ foo@foo.com │ {{foo},{foo}} │ │ foo@example.com │ {{foo}} │ │ foo@go.bar.me.net │ {{foo},{bar}} │ │ baz@example.com │ {{baz}} │ └───────────────────┴───────────────────┘
SUB-SELECT
I also figured out how to get an array of profanity score subtotals for each record with this SQL:
SELECT
data.email_address,
array(
SELECT score * (
SELECT COUNT(*)
FROM (SELECT
regexp_matches(data.email_address, p.posix_regexp, 'gi')
) matches
)
FROM profanities p
) prof
from data;
Which correctly yields all rows (including rows without profanities) as such:
┌───────────────────┬──────────┐ │ email_address │ prof │ ├───────────────────┼──────────┤ │ foo@go.bar.me.net │ {9,5,0} │ │ foo@foo.com │ {18,0,0} │ │ foo@example.com │ {9,0,0} │ │ baz@example.com │ {0,0,3} │ │ barred@qux.com │ {0,0,0} │ └───────────────────┴──────────┘
Problem
How do I sum the result of a lateral join to get the desired output?
Is there another strategy I can use to get the desired result?
I've posted a live code fiddle for this question at http://sqlfiddle.com/#!17/6685c/4