SQL calculate item frequency using multiple / dependent columns?

Question

I'm completely new to SQL, and have read StackOverflow posts on SQL to try and figure this out, and other sources and unable to do this in SQL. Here goes...

I have a table of 3 columns and thousands of rows, with data for first 2 columns. The third column is currently empty and I need to populate the third column based on data already in the first and second columns.

Say I have states in the first column and fruit entries in the second column. I need to write an SQL statement(s) that calculates the number of different states where each fruit comes from, and then inserts this popularity number into the third column for every row. A popularity number of 1 in that row means that fruit only comes from one state, a popularity number of 4 means the fruit comes from 4 states. So my table is currently like:

state     fruit     popularity

hawaii    apple     
hawaii    apple     
hawaii    banana       
hawaii    kiwi      
hawaii    kiwi      
hawaii    mango        
florida   apple      
florida   apple        
florida   apple        
florida   orange      
michigan  apple     
michigan  apple     
michigan  apricot   
michigan  orange    
michigan  pear      
michigan  pear      
michigan  pear      
texas     apple     
texas     banana    
texas     banana    
texas     banana    
texas     grape

And I need to figure out how to calculate and then update the third column, named popularity, which is the number of states that exports that fruit. The goal is to produce (sorry bad pun) the table below, where based on above table, "apple" appears in all 4 states, oranges and banana appear in 2 states, and kiwi, mango, pear, and grape only appear in 1 state, hence their corresponding popularity numbers.

state     fruit     popularity

hawaii    apple     4
hawaii    apple     4
hawaii    banana    2   
hawaii    kiwi      1
hawaii    kiwi      1
hawaii    mango     1   
florida   apple     4 
florida   apple     4   
florida   apple     4   
florida   orange    2  
michigan  apple     4
michigan  apple     4
michigan  apricot   1
michigan  orange    2
michigan  pear      1
michigan  pear      1
michigan  pear      1
texas     apple     4
texas     banana    2
texas     banana    2
texas     banana    2
texas     grape     1

My small programmer brain says to try and figure out a way to loop through the data in some kind of script, but reading up a little on SQL and databases, it seems like you don't write long and slow looping scripts in SQL. I'm not even sure if you can? but instead that there are better/faster ways to do this in SQL.

Anyone know how to, in SQL statement(s), calculate and update the third column for each row, which is here called popularity and corresponds to the number of states that each fruit comes from? Thanks for reading, very grateful for any help.

So far I have tried these SQL statements below, which output but don't quite get me what I need:

--outputs those fruits appearing multiple times in the table
SELECT fruit, COUNT(*)
  FROM table 
 GROUP BY fruit
HAVING COUNT(*) > 1
 ORDER BY COUNT(*) DESC

--outputs those fruits appearing only once in the table
SELECT fruit, COUNT(*)
  FROM table 
 GROUP BY fruit
HAVING COUNT(*) = 1

--outputs list of unique fruits in the table
SELECT COUNT (DISTINCT(fruit))
  FROM table

This is a really bad data model. Aggregations (such as POPULARITY in your example) should either by calculated on demand or else held in separarate data structures (such as Materialized Views) if the data volumes and performance requirements deem that necessary. — APC, Aug 01 '12 at 14:45
To understand why it's a bad data model consider this scenario. You delete one row {hawaii, kiwi}. At the same time in anotehr session I insert two rows {minnesota, kiwi} and {new york, kiwi}. What is the correct value of POPULARITY for kiwi? And what will happen when we both attempt to update all those other rows where `FRUIT = 'kiwi'`? — APC, Aug 01 '12 at 14:50
@yaroslav - the use of CASE in SQL statements is a matter of personal taste (or organizational standards). The code as originally presented was perfectly valid. — APC, Aug 01 '12 at 14:58
@APC, I do not understand, wich CASE you are talking about? I do not see any CASE on the original code, I have double checked just in case. When I edit code I do not delete anything, just improve formatting — Yaroslav, Aug 01 '12 at 15:04
@yaroslav - sorry I meant UPPER CASE. You capiltalised all the keywords, when lower case is just fine (and more readable). — APC, Aug 01 '12 at 15:58
@APC this can be, and has been, discussed many many times on every tSQL forums, blog, etc. On my opinion, and some others, capitalizing keywords is more readable. But thanks for pointing it out — Yaroslav, Aug 01 '12 at 17:12
@APC Thanks for pointing out the problem with my data model. Agreed. As I was creating the 'popularity' column I felt this was not the way a real database / SQL programmer would do this but would instead calculate popularity on the fly each time it is needed, like you suggest, rather than store this value in the table. Thanks for the tip on Materialized Views, reading about this now... — Tim West, Aug 03 '12 at 20:53

score 4 · Accepted Answer · answered Aug 01 '12 at 14:45

4

If you want to simply update your table with the priority it would look like:

update my_table x
   set popularity = ( select count(distinct state) 
                        from my_table
                       where fruit = x.fruit )

If you want to select the data then you can use an analytic query:

select state, fruit
     , count(distinct state) over ( partition by fruit ) as popularity
  from my_table

This provides the number of distinct states, per fruit.

answered Aug 01 '12 at 14:45

Ben

51,770
36
127
149

thanks so much!!! The UPDATE solution you offered worked great on my small practice (fruit) table, however when I tried this using my actual table which contained **millions** of rows it was taking over an hour and still not finished, I'm guessing something about UPDATE for each row being quite costly? The second SELECT solution you offered turned out to work great and much much faster when I used it after CREATE TABLE new_table as(SELECT...). I'm guessing creating a new table is poor practice for SQL, but it was somehow incredibly faster on my table than the UPDATE option. T H A N K S ! – Tim West Aug 03 '12 at 20:39

Ollie · Answer 2 · 2012-08-01T14:55:11.100

I ran this and got (what I think) is what you want:

WITH t
  AS (SELECT 'hawaii' as STATE, 'apple' as fruit FROM dual
      UNION ALL
      SELECT 'hawaii' as STATE, 'apple' as fruit FROM dual
      UNION ALL
      SELECT 'hawaii' as STATE, 'banana' as fruit FROM dual
      UNION ALL
      SELECT 'hawaii' as STATE, 'kiwi' as fruit FROM dual
      UNION ALL
      SELECT 'hawaii' as STATE, 'kiwi' as fruit FROM dual
      UNION ALL
      SELECT 'hawaii' as STATE, 'mango' as fruit FROM dual
      UNION ALL
      SELECT 'florida' as STATE, 'apple' as fruit FROM dual
      UNION ALL
      SELECT 'florida' as STATE, 'apple' as fruit FROM dual
      UNION ALL
      SELECT 'florida' as STATE, 'apple' as fruit FROM dual
      UNION ALL
      SELECT 'florida' as STATE, 'orange' as fruit FROM dual
      UNION ALL
      SELECT 'michigan' as STATE, 'apple' as fruit FROM dual
      UNION ALL
      SELECT 'michigan' as STATE, 'apple' as fruit FROM dual
      UNION ALL
      SELECT 'michigan' as STATE, 'apricot' as fruit FROM dual
      UNION ALL
      SELECT 'michigan' as STATE, 'orange' as fruit FROM dual
      UNION ALL
      SELECT 'michigan' as STATE, 'pear' as fruit FROM dual
      UNION ALL
      SELECT 'michigan' as STATE, 'pear' as fruit FROM dual
      UNION ALL
      SELECT 'michigan' as STATE, 'pear' as fruit FROM dual
      UNION ALL
      SELECT 'texas' as STATE, 'apple' as fruit FROM dual
      UNION ALL
      SELECT 'texas' as STATE, 'banana' as fruit FROM dual
      UNION ALL
      SELECT 'texas' as STATE, 'banana' as fruit FROM dual
      UNION ALL
      SELECT 'texas' as STATE, 'banana' as fruit FROM dual
      UNION ALL
      SELECT 'texas' as STATE, 'grape' as fruit FROM dual)
SELECT state,
       fruit,
       count(DISTINCT state) OVER (PARTITION BY fruit) AS popularity
  FROM t;

Returned

florida     apple   4
florida     apple   4
florida     apple   4
hawaii      apple   4
hawaii      apple   4
michigan    apple   4
michigan    apple   4
texas       apple   4
michigan    apricot 1
hawaii      banana  2
texas       banana  2
texas       banana  2
texas       banana  2
texas       grape   1
hawaii      kiwi    1
hawaii      kiwi    1
hawaii      mango   1
florida     orange  2
michigan    orange  2
michigan    pear    1
michigan    pear    1

Obviously, you'd only need to run:

SELECT state,
       fruit,
       count(DISTINCT state) OVER (PARTITION BY fruit) AS popularity
  FROM table_name;

Hope it helps...

score 0 · Answer 3 · answered Aug 01 '12 at 14:42

0

If your table is #fruit...

To count the different states for each fruit

select fruit, COUNT(distinct state) statecount from #fruit group by fruit

and so to update the table with these values

update #fruit
set popularity
    = statecount
from
 #fruit
    inner join 
      (select fruit, COUNT(distinct state) statecount from #fruit group by fruit) sc
        on #fruit.fruit = sc.fruit

answered Aug 01 '12 at 14:42

podiluska

50,950
7
98
104

I'm fairly sure that this is not Oracle syntax. – Ben Aug 01 '12 at 14:48

score 0 · Answer 4 · answered Aug 01 '12 at 14:43

0

This should get you most of the way there. Basically you want to get a count of distinct states that the fruit is in and then use that to join back to the original table.

update table
set count = cnt
from 
  (
    select fruit, count(distinct state) as cnt 
    from table
    group by fruit) cnts
  inner join table t
    on cnts.fruit = t.fruit

answered Aug 01 '12 at 14:43

Derek

21,828
7
53
61

I'm fairly sure that this is not Oracle syntax. – Ben Aug 01 '12 at 14:47

score 0 · Answer 5 · answered Aug 01 '12 at 14:45

0

Another option:

SELECT fruit
,      COUNT(*)
FROM
(
SELECT state
,      fruit
,      ROW_NUMBER() OVER (PARTITION BY state, fruit ORDER BY NULL) rn
FROM   t
)
WHERE rn = 1
GROUP BY fruit
ORDER BY fruit;

answered Aug 01 '12 at 14:45

Tebbe

1,372
9
12

This only returns the column fruit. Then only one randomly selected row. – Ben Aug 01 '12 at 14:48
So turn it into an update? `UPDATE t SET popularity = (SELECT x.qty FROM (SELECT fruit , COUNT(*) qty FROM (SELECT state, fruit, ROW_NUMBER() OVER (PARTITION BY state, fruit ORDER BY NULL) rn FROM t) WHERE rn = 1 GROUP BY fruit) x WHERE x.fruit = t.fruit);` This version worked for me. – Tebbe Aug 01 '12 at 14:53

score 0 · Answer 6 · answered Aug 01 '12 at 14:49

0

Try this:

select a.*,b.total
from [table] as a
left join 
(
SELECT fruit,count(distinct [state]) as total
  FROM [table]
  group by fruit
) as b
on a.fruit = b.fruit

Note this is SQL Server code, do your own tweaks if necessary.

answered Aug 01 '12 at 14:49

xbb

2,073
1
19
34

score 0 · Answer 7 · answered Aug 01 '12 at 15:08

try this

create table states([state] varchar(10),fruit varchar(10),popularity int)
INSERT INTO states([state],fruit) 
VALUES('hawaii','apple'),
('hawaii','apple'),     
('hawaii','banana'),       
('hawaii','kiwi'),      
('hawaii','kiwi'),      
('hawaii','mango'),        
('florida','apple'),      
('florida','apple'),        
('florida','apple'),        
('florida','orange'),      
('michigan','apple'),     
('michigan','apple'),     
('michigan','apricot'),   
('michigan','orange'),    
('michigan','pear'),      
('michigan','pear'),      
('michigan','pear'),      
('texas','apple'),     
('texas','banana'),    
('texas','banana'),    
('texas','banana'),
('texas','grape')

update t set t.popularity=a.cnt
from states t inner join
(SELECT fruit,count(distinct [state]) as cnt
  FROM states
  group by fruit) a
on t.fruit =a.fruit

SQL calculate item frequency using multiple / dependent columns?

7 Answers7

Linked