I have a table containing historical data with 2 columns, both strings. For example:
text | year_month |
---|---|
text1 | 2023-02 |
text2 | 2023-02 |
text3 | 2020-10 |
text1 | 2019-01 |
text2 | 2020-01 |
Based on this, for each month, I'd like to count the number of text strings that are duplicates of text strings that appeared in past months. So, for the above example, my output should be:
year_month | duplicate_count |
---|---|
2023-02 | 2 |
2020-10 | 0 |
2019-01 | 0 |
2020-01 | 0 |
How can I do this in SQL?
I could start by joining the input table with itself on an equality condition on text and an inequality condition on year_month. But is there a more efficient direction?
For reference, Databricks SQL is what I'm using for this.
Thanks.
UPDATE: Below is what I was referring to by joining the table with itself. This query works for my data, assuming there are no duplicates within the same month, an assumption I can relax by adding some sort of row ID to the data. See this link for this query in action on a bit more complex version of the example above.
SELECT year_month, SUM(is_duplicate) as duplicate_count FROM
-- Inner query that does self join
-- It tags a row in the input table as duplicate or not
(SELECT t1.year_month, t1.text, MAX(CASE WHEN t2.year_month IS NULL THEN 0 ELSE 1 END) AS is_duplicate
FROM input_table t1
LEFT JOIN input_table t2 ON t1.text = t2.text AND
t1.year_month > t2.year_month
group by t1.year_month, t1.text) as T
GROUP BY year_month