9

Problem:

Say there is a simple (yet big) table foods

id   name 
--   -----------  
01   ginger beer
02   white wine
03   red wine
04   ginger wine

I'd like to count how many entries have specific hardcoded patterns, say contain words 'ginger' (LIKE '%ginger%') or 'wine' (LIKE '%wine%'), or whatever else in them, and write these numbers into rows along comments. The result I'm looking for is the following

comment           total 
---------------   -----  
contains ginger   2
for wine lovers   3

Solution 1 (good format but inefficient):

It is possible to use UNION ALL and construct the following

SELECT * FROM
(
  (
    SELECT
      'contains ginger' AS comment,
      sum((name LIKE '%ginger%')::INT) AS total
    FROM foods
  )
  UNION ALL
  (
    SELECT
      'for wine lovers' AS comment,
      sum((name LIKE '%wine%')::INT) AS total
    FROM foods
  )
)

Apparently it works similarly to simply executing multiple queries and sewing them together afterwards. It is very inefficient.

Solution 2 (efficient but bad format):

The following is multiple times faster compared to previous solution

SELECT
  sum((name LIKE '%ginger%')::INT) AS contains_ginger,
  sum((name LIKE '%wine%')::INT) AS for_wine_lovers
FROM foods

And the result is

contains_ginger   for_wine_lovers 
---------------   ---------------  
2                 3

So it is definitely possible to get the same information much faster, but in a wrong format...

Discussion:

What is the best overall approach? What should I do to get the result I want in an efficient manner and preferable format? Or is it really impossible?

By the way, I am writing this for Redshift (based on PostgreSQL).

Thanks.

Pranasas
  • 597
  • 6
  • 22

12 Answers12

2

In both the queries LIKE operator is used. Alternatively We can use Position to find the location of the hardcoded words in the name. If hardcoded words are available in the name then a number greater than 0 will be returned.

SELECT 
       unnest(array['ginger', 'wine']) AS comments,
       unnest(array[ginger, wine]) AS count
FROM(
     (SELECT sum(contains_ginger) ginger , sum(contains_wine) wine
        FROM
             (SELECT CASE WHEN Position('ginger' in name)>0 
                          THEN 1 
                           END contains_ginger,
                     CASE WHEN Position('wine' in name) > 0 
                          THEN 1
                           END contains_wine
                 FROM foods) t) t1
Valli
  • 1,440
  • 1
  • 8
  • 13
  • what about when the name is contains both? 'ginger wine' or 'apple pear' – Matt Sep 22 '17 at 18:37
  • Thats right. Current sql wont handle this. I have overlooked the desired output. – Valli Sep 23 '17 at 01:13
  • @Matt I tried using unnest to get the desired output. I refered https://stackoverflow.com/questions/1128737/unpivot-and-postgresql to get idea about unnest – Valli Sep 23 '17 at 03:47
1

Try this for size:

Declare @searchTerms table (term varchar(100), comment varchar(100))
insert into @searchTerms values
('ginger','contains ginger')
,('wine','for wine lovers')
-- Add any others here

select t.comment, isnull(count(f.id),0) [total]
from @searchTerms t
left join food f on (f.name like '%'+t.term+'%')
group by t.term
order by 1

I'm not sure what the temp table syntax is for postgresql - this example is for MS SQL Server, but I'm sure you get the idea

UPDATE: According to the online converter at SQLines the syntax is effectively the same

Pete
  • 1,807
  • 1
  • 16
  • 32
  • this is the right idea, but why not adapt to be specific for postgresql? there are some great test sites out there http://rextester.com/l/postgresql_online_compiler is one. Also probably need to expand the search term tables to have 2 parts. 1 for the search term ('ginger', 'wine') and 1 for description that will be used ('contains ginger', 'for wine lovers') – Matt Sep 22 '17 at 18:41
  • @Matt, because I’m a MS SQL developer, not a Postgres sql developer. – Pete Sep 22 '17 at 19:03
  • almost the same in regards to your answer and the question is tagged postgresql not sql-server. give it a try I too am a sql server developer but found it fun playing around with the other RDBMS same concepts sometimes there are shortcuts on the other platforms that Microsoft didn't implement. – Matt Sep 22 '17 at 19:05
1

option 1: manually reshape

CREATE TEMPORARY TABLE wide AS (
  SELECT
    sum((name LIKE '%ginger%')::INT) AS contains_ginger,
    sum((name LIKE '%wine%')::INT) AS for_wine_lovers
    ...
  FROM foods;
SELECT
  'contains ginger', contains_ginger FROM wide

UNION ALL
SELECT 
  'for wine lovers', contains_wine FROM wine

UNION ALL
...;

option 2: create a categories table & use a join

-- not sure if redshift supports values, hence I'm using the union all to build the table
WITH categories (category_label, food_part) AS (
    SELECT 'contains ginger', 'ginger'
    union all
    SELECT 'for wine lovers', 'wine'
    ...
)
SELECT
categories.category_label, COUNT(*)
FROM categories
LEFT JOIN foods ON foods.name LIKE ('%' || categories.food_part || '%')
GROUP BY 1

Since your solution 2 you consider to be fast enough, option 1 should work for you.

Option 2 should also be fairly efficient, and it is much easier to write & extend, and as an added bonus, this query will let you know if no foods exist in a given category.

Option 3: Reshape & redistribute your data to better match the grouping keys.

You could also pre-process your dataset if the query execution time is very important. A lot the benefits of this depend on your data volume and data distribution. Do you only have a few hard categories, or will they be searched dynamically from some sort of interface.

For example:

If the dataset were reshaped like this:

content   name 
--------  ----
ginger    01
ginger    04
beer      01
white     02
wine      02 
wine      04
wine      03

Then you could shard & distribute on content, and each instance could execute that part of the aggregation in parallel.

Here an equivalent query might look like this:

WITH content_count AS (
  SELECT content, COUNT(*) total
  FROM reshaped_food_table 
  GROUP BY 1
)
SELECT
    CASE content 
      WHEN 'ginger' THEN 'contains ginger'
      WHEN 'wine' THEN 'for wine lovers'
      ELSE 'other' 
    END category
  , total
FROM content_count
Haleemur Ali
  • 26,718
  • 5
  • 61
  • 85
  • I like option 2, it can be improved by using VALUES() instead of unions and selects. – bobflux Sep 22 '17 at 16:30
  • I coudn't remember if redshift supported the `values` clause – Haleemur Ali Sep 22 '17 at 16:33
  • really only option 2 should be considered here. option 1 union alls probably wont be a performant but I wont claim that with 100% certainty definitely more work coding though! Option 3 lacks the ability for a food to be 2 categories eg. ('ginger wine', 'apple pear'). if you wanted to go option 3 route it could be more efficient to have a cross reference table between food and categories and count on the cross reference table itself. – Matt Sep 22 '17 at 18:57
  • @Matt, its impossible for us to say what would be "sufficiently" fast for the OP, that's why I suggested 3 different ways. Hopefully the OP is able to test and pick the best solution for the problem. IMO, option 2 is the most elegant of the 3 options I have suggested. Option 1 should be sufficiently fast because the `union all` is picking from a relation with 1 row, which is low overhead, and the OP claims his/her Solution 2 is fast enough. Option 3 would work for the example the OP provided, and it can be extended for more complicated predicates. – Haleemur Ali Sep 22 '17 at 19:31
  • @peufeu: I don't think `VALUES` is supported in Redshift. – Erwin Brandstetter Sep 27 '17 at 03:46
  • You provided the best all around overview. Just please fix the typo in option 1, and `CREATE TEMP TABLE wide AS` works much faster than `WITH wide AS` (apparently a weakness in the Redshift CTE implementation). – Pranasas Sep 29 '17 at 06:35
  • @Pranasas could you tell us which version works best for you, and I can update my answer for future readers. – Haleemur Ali Sep 29 '17 at 14:16
  • @HaleemurAli I ended up using what is closest to option 2. Thank you. – Pranasas Oct 02 '17 at 11:06
1

I don't know about Redshift, but in Postgres I'd start with something like this:

WITH foods (id, name) AS (VALUES 
  (1, 'ginger beer'), (2, 'white wine'), (3, 'red wine'), (4, 'ginger wine'))
SELECT hardcoded.comment, count(*)
FROM (VALUES ('%ginger%', 'contains ginger'), ('%wine%', 'for wine lovers'))
  AS hardcoded (pattern, comment)
JOIN foods ON foods.name LIKE hardcoded.pattern
GROUP BY hardcoded.comment;

┌─────────────────┬───────┐
│     comment     │ count │
├─────────────────┼───────┤
│ contains ginger │     2 │
│ for wine lovers │     3 │
└─────────────────┴───────┘
(2 rows)

and if it's OK then I'd go on to create appropriate indexes on foods.name. That might include indexes on name and reverse(name); or perhaps (name gist_trgm_ops), but I don't expect Redshift to provide pg_trgm.

pstef
  • 91
  • 7
1

Redshift is rather limited in comparison to modern Postgres.
No unnest(), no array_agg(), no ARRAY constructor, no VALUES expression, no LATERAL joins, no tablefunc module. All the tools that would make this nice an simple. At least we have CTEs ...

This should work and be fast and relatively simple to expand:

WITH ct AS (
   SELECT a.arr
        , count(name ~ arr[1] OR NULL) AS ct1
        , count(name ~ arr[2] OR NULL) AS ct2
        , count(name ~ arr[3] OR NULL) AS ct3
     -- , ... more
   FROM   foods
   CROSS  JOIN (SELECT '{ginger, wine, bread}'::text[]) AS a(arr)
   )
SELECT arr[1] AS comment, ct1 AS total FROM ct
UNION ALL SELECT arr[2], ct2 FROM ct
UNION ALL SELECT arr[3], ct3 FROM ct
--  ... more

I use the Posix operator ~ to replace LIKE, just because it's shorter and no need for added placeholder %. Performance is about the same for this simple form in Postgres, not sure about Redshift.

count(boolean_expression OR NULL) should be a bit faster than sum(boolean_expression::int).

Indexes won't be able to improve performance of this single sequential scan over the whole table.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Some benchmarking from my mock setup. The first part alone (counts) takes ~20s while the whole query takes ~40s. If counts are saved into a temporary table, the second part (unions) takes ~5s initially and <1s afterwards. – Pranasas Sep 28 '17 at 08:14
  • @Pranasas: 1) How many rows in `foods` and how many array items did you test? 2) How does it compare to your initial queries? 3) 20s for the 2nd part is odd, because that should cost very little. Maybe a weakness in the Redshift CTE implementation? Anyway, if you create a temp table, it might help to run `ANALYZE tmp_tbl;` between the 2 steps (at least in Postgres, see: https://dba.stackexchange.com/a/18694/3684). – Erwin Brandstetter Sep 28 '17 at 13:19
  • The whole query is on par with my initial query, so a weakness in the Redshift CTE implementation. Also, `count(boolean)` is not actually valid because it counts everything except `NULL` in Redshift. – Pranasas Sep 29 '17 at 06:23
  • Sorry for avoiding the question about rows and sizes. `foods` is a simplified representation of my problem. In reality, it is a derived table calculated when needed. I believe these details are not necessary for the problem, though. – Pranasas Sep 29 '17 at 06:29
  • @Pranasas: 1) `count()` counts everything but `NULL`, that's standard SQL, and that is the point exactly. Consider the expression `name ~ arr[1] OR NULL` and see https://stackoverflow.com/a/26968969/939860 or https://stackoverflow.com/a/15650897/939860. 2) Rows, sizes, table definitions are practically *always* relevant. `derived table calculated when needed` is relevant, too. It might help to run `ANALYZE` after creating the derived table. – Erwin Brandstetter Sep 29 '17 at 13:20
0

A little bit of searching suggests that you could use your second approach for efficiency, and place the result into a CTE, which you then unnest(), as per: unpivot and PostgreSQL

Peter Abolins
  • 1,520
  • 1
  • 11
  • 18
0

Try this -

SELECT 'contains ginger' AS comment
      , Count(*) AS total
FROM foods
WHERE name LIKE '%ginger%'
UNION ALL
SELECT 'for wine lovers',
      , count(*)
FROM foods
WHERE name LIKE '%wine%'
Ankit Bajpai
  • 13,128
  • 4
  • 25
  • 40
0

From your sample it seems like your product names contain up to 2 words. It's more efficient to break by space and check if individual chunks match than like, and then manually reshape as said in the other response

WITH counts as (
    SELECT 
      sum(('ginger' in (split_part(name,' ',1),split_part(name,' ',2)))::INT) AS contains_ginger,
      sum(('wine' in (split_part(name,' ',1),split_part(name,' ',2)))::INT) AS for_wine_lovers
    FROM foods
)
-- manually reshape
AlexYes
  • 4,088
  • 2
  • 15
  • 23
0

Have you considered using cursors?

Here is an example I wrote for SQL Server.

You just need to have some table with all the values you want to make search (I called it SearchWordTable in the example below and the column name SearchWord) in the foods table.

CREATE TABLE #TemporaryTable 
(
    KeyWord nvarchar(50),
    ResultCount int
);

DECLARE @searchWord nvarchar(50)
DECLARE @count INT

DECLARE statistic_cursor CURSOR FOR   
SELECT SearchWord
FROM SearchWordTable

OPEN statistic_cursor  
FETCH NEXT FROM statistic_cursor INTO @searchWord  

WHILE @@FETCH_STATUS = 0  
BEGIN  
    SELECT @count = COUNT(1) FROM foods
    WHERE name LIKE '%'+@searchWord+'%'

    INSERT INTO #TemporaryTable (KeyWord, ResultCount) VALUES (@searchWord, @count)

    FETCH NEXT FROM product_cursor INTO @product  
END  

CLOSE product_cursor  
DEALLOCATE product_cursor

SELECT * FROM #TemporaryTable

DROP #TemporaryTable
Stefan Taseski
  • 242
  • 2
  • 24
  • this would definitely not be the best performance a loop based/cursor based solution will be significantly slower than a set based operation – Matt Sep 22 '17 at 18:37
0

I think the best option is to split the ingredients list into parts and then to count them.

"Pass0".."Pass4" and "Numbers" is just a Tally table to get a list of numbers 1..256 to emulate the unnest.

"comments" is a a simple table you should have somewhere with ingredients and their comments

use your table "foods" instead of mine ;)

let's take a look

with
Pass0 as (select '1' as C union all select '1'), --2 rows
Pass1 as (select '1' as C from Pass0 as A, Pass0 as B),--4 rows
Pass2 as (select '1' as C from Pass1 as A, Pass1 as B),--16 rows
Pass3 as (select '1' as C from Pass2 as A, Pass2 as B),--256 rows
numbers as (
    select ROW_NUMBER() OVER(ORDER BY C) AS N FROM Pass3
),    
comments as (
    select 'ginger' ingredient, 'contains ginger' comment union all 
    select 'wine', 'for wine lovers' union all 
    select 'ale', 'a warm kind of beer' union all 
    select 'beer', 'your old friend'
),
foods as (
    select 01 id, 'ginger beer' name union all 
    select 02   ,'white wine' union all 
    select 03   ,'red wine' union all 
    select 04   ,'ginger wine' union all 
    select 05   ,'ginger ale' union all 
    select 06   ,'pale ale' union all 
    select 07   ,'ginger beer' union all 
),
ingredients as (
    select ingredient, COUNT(*) n
    from foods d
    CROSS JOIN LATERAL(
        select SPLIT_PART(d.name, ' ', n.n) ingredient
        from numbers n
        where SPLIT_PART(d.name, ' ', n.n)<>''
    ) ingredients
    group by ingredient
)
select i.*, isnull(c.comment, 'no comment..') comment
from ingredients i
left join comments c on c.ingredient = i.ingredient

ingredient  n   comment
ale         2   a warm kind of beer
beer        2   your old friend
ginger      4   contains ginger
pale        1   no comment..
red         1   no comment..
white       1   no comment..
wine        3   for wine lovers
MtwStark
  • 3,866
  • 1
  • 18
  • 32
0

Here you go.

The WHERE filter reduces the rows going into the GROUP BY aggregation. It's not necessary for smaller data but will help if the table is in the billions of rows. Add additional patterns to the REGEXP filter and the CASE statement.

SELECT CASE WHEN name LIKE '%ginger%' THEN 'contains ginger' 
            WHEN name LIKE '%wine%'   THEN 'for wine lovers'
       ELSE NULL END "comment"
      ,COUNT(*) total
FROM grouping_test
WHERE REGEXP_INSTR(name,'ginger|wine')
GROUP BY 1
;
Joe Harris
  • 13,671
  • 4
  • 47
  • 54
0

Try with SQL like this:

SELECT count(1) as total,'contains ginger' result
FROM foods where names LIKE '%ginger%' 
union all
SELECT count(1),'for wine lovers' 
FROM foods where names LIKE '%wine%'
I Love You
  • 268
  • 2
  • 8