-2

I have a table containing historical data with 2 columns, both strings. For example:

text year_month
text1 2023-02
text2 2023-02
text3 2020-10
text1 2019-01
text2 2020-01

Based on this, for each month, I'd like to count the number of text strings that are duplicates of text strings that appeared in past months. So, for the above example, my output should be:

year_month duplicate_count
2023-02 2
2020-10 0
2019-01 0
2020-01 0

How can I do this in SQL?

I could start by joining the input table with itself on an equality condition on text and an inequality condition on year_month. But is there a more efficient direction?

For reference, Databricks SQL is what I'm using for this.

Thanks.

UPDATE: Below is what I was referring to by joining the table with itself. This query works for my data, assuming there are no duplicates within the same month, an assumption I can relax by adding some sort of row ID to the data. See this link for this query in action on a bit more complex version of the example above.

SELECT year_month, SUM(is_duplicate) as duplicate_count FROM

  -- Inner query that does self join
  -- It tags a row in the input table as duplicate or not
  (SELECT t1.year_month, t1.text, MAX(CASE WHEN t2.year_month IS NULL THEN 0 ELSE 1 END) AS is_duplicate
  FROM input_table t1
  LEFT JOIN input_table t2 ON t1.text = t2.text AND 
  t1.year_month > t2.year_month
  group by t1.year_month, t1.text) as T

GROUP BY year_month
  • The more efficient direction would be fixing your data and not storing a year and a month as a delimited string - store them as two integer columns or use a proper `date` data type. – Stu Jul 18 '23 at 16:03
  • Sure, I can do that. Still, how will approach the query to count duplicates? – Fijoy Vadakkumpadan Jul 18 '23 at 16:04
  • The approach I would use would be to count where qualifying rows `exists` - but the way this would work depends on inequality comparisons so fixing your data first would make sense if you can do it. – Stu Jul 18 '23 at 16:07
  • @JonasMetzler If you can't help, let someone else. And you missed my reference to what I tried. – Fijoy Vadakkumpadan Jul 18 '23 at 16:13
  • 1
    Well, I think I can help, but proving sufficient effort and explaining why your own attempt went wrong is a must have when asking a question on SO. The downvotes on your question show this. You said "I could start...", but that's very vague and you didn't show a concrete query. Maybe you are looking for something like this?: https://dbfiddle.uk/vbNKfUwQ You have very few sample data, I am not sure whether this would cover all your use cases. As said, you could edit and improve your question... – Jonas Metzler Jul 18 '23 at 16:19
  • @JonasMetzler Thanks for the effort and the dbfiddle website reference (which I wasn't aware existed). I have clarified the question with the exact query I meant earlier in my reference to self join. For the record, SO community is biased against harder questions, which defeats the purpose of SO. Easier questions get plenty of support, e.g.: https://stackoverflow.com/questions/18193365/count-of-non-null-columns-in-each-row. Harder questions get downvoted - "helpers" assume it's OP's fault. Harder questions are harder to pin down by definition. Sigh. – Fijoy Vadakkumpadan Jul 18 '23 at 18:05

2 Answers2

0
SELECT year_month, count(nullif(r, 1)) as duplicate_count
  FROM (SELECT text, year_month,
               rank() over (partition by text order by year_month) as r
          FROM input_table) x
 GROUP BY year_month
 ORDER BY year_month DESC;
Kombajn zbożowy
  • 8,755
  • 3
  • 28
  • 60
-1

With a small tweak Kombajn's answer will handle the edge case hinted at in your updated question -

assuming there are no duplicates within the same month

SELECT `year_month`, COUNT(DISTINCT IF(r > 1, `text`, NULL)) AS duplicate_count
FROM (
    SELECT `text`, `year_month`,
           RANK() OVER (PARTITION BY `text` ORDER BY `year_month`) AS r
    FROM input_table
) x
GROUP BY `year_month`
ORDER BY `year_month` DESC;

Note: year_month needs to be quoted in backticks as it is a reserved word (this may not be the case in Databricks SQL).

I think your attempt, with the self left join, can be simplified to remove the derived table:

SELECT t1.year_month, COUNT(DISTINCT t2.text) AS duplicate_count
FROM input_table t1
LEFT JOIN input_table t2 ON t1.text = t2.text AND t1.year_month > t2.year_month
GROUP BY t1.year_month
ORDER BY t1.year_month DESC;

Note: year_month does not need to be quoted in this query as it is being qualified with the table alias.

The self join will probably benefit from both PK (`year_month`, `text`) and IDX(`text`, `year_month`), whereas Kombajn's query just needs the PK.

Try both queries against your data and see how they perform. Use EXPLAIN and/or EXPLAIN ANALYZE to see how much work the server is having to do to produce the result sets.

Here is a db<>fiddle

user1191247
  • 10,808
  • 2
  • 22
  • 32
  • @ShadowWizardStrikesBack It was just responding to [OP's comment](https://stackoverflow.com/questions/76714593/how-to-count-duplicates-based-on-historical-data-in-sql#comment135250396_76714593). Do you agree with their comment? – user1191247 Jul 19 '23 at 15:24
  • 1
    I didn't see it before. But as annoying as it might be, proper response to such comment should be in the form of comment reply. Thanks for editing it out. – Shadow The GPT Wizard Jul 19 '23 at 16:57