0

I have a query that pulls from a table. With this table, I would like to build a query that allows me to make projections into the future.

SELECT
b.date,
a.id,
SUM(CASE WHEN a.date = b.date THEN a.sales ELSE 0 END) sales,
SUM(CASE WHEN a.date = b.date THEN a.revenue ELSE 0 END) revenue
FROM
table_a a
CROSS JOIN table_b b
WHERE a.date BETWEEN '2018-10-31' AND '2018-11-04'
GROUP BY 1,2

table_b is a table with literally only one column that contains dates going deep into the future. This returns results like this:

+----------+--------+-------+---------+
|   date   |   id   | sales | revenue |
+----------+--------+-------+---------+
| 11/4/18  | 113972 |     0 |       0 |
| 11/4/18  | 111218 |     0 |       0 |
| 11/3/18  | 111218 |     0 |       0 |
| 11/3/18  | 113972 |     0 |       0 |
| 11/2/18  | 111218 |     0 |       0 |
| 11/2/18  | 113972 |     0 |       0 |
| 11/1/18  | 111218 |    89 | 2405.77 |
| 11/1/18  | 113972 |   265 | 3000.39 |
| 10/31/18 | 111218 |    64 | 2957.71 |
| 10/31/18 | 113972 |   120 | 5650.91 |
+----------+--------+-------+---------+

Now there's more to the query after this where I get into the projections and what not, but for the purposes of this question, this is all you need, as it's where the CROSS JOIN exists.

How can I recreate these results without using a CROSS JOIN? In reality, this query is a much larger date range with way more data and takes hours and so much power to run and I know CROSS JOIN's should be avoided if possible.

  • 2
    Inner Join on a.date = b.date, then eliminate the CASE statements and leave everything else the same. However, this may eliminate some rows from the output where there are no matching date values, with 0 sums. But if you really need all combinations of a.ID and b.date, then by definition you must have a CROSS JOIN since to produce all combinations from both tables is a Cartesian Product and is not produced by other types of joins. – C Perkins Nov 02 '18 at 03:56
  • 2
    Hi, when writing optimal Redshift queries a lot depends on your schema and how the data is distributed and sorted. You might get a more useful answer if you can share the DDL of the tables involved and maybe some indication of the volume of data and the size of your cluster. – Nathan Griffiths Nov 02 '18 at 04:02
  • @CPerkins What about this - I create a table that is nothing but a `CROSS JOIN` involving date and id. So I'll have a table ready to that already handles the `CROSS JOIN` and then I can go from there - could that work? –  Nov 02 '18 at 04:25
  • @gooponyagrinch what mean table here? your logic and info not clear.. You make table from data you want do.. Just same as operate `cross` here.. – dwir182 Nov 02 '18 at 06:06
  • 1
    Amazon Redshift is not the same as PostgreSQL. Which one are you really using? –  Nov 02 '18 at 06:50
  • What do you mean, "without the cross join"? If you need every id-date pair (in range) then you need the cross join of the tables. INNER JOIN ON is just CROSS JOIN plus WHERE. `,` is cross join (with lower precedence). There might be performance issues with any query but blindly worrying about CROSS JOIN is misconceived. [CROSS JOIN = (INNER) JOIN = comma (",")](https://stackoverflow.com/a/25957600/3404097) Learn about query implementation/optimization generally & for your DBMS. PS Please read & act on [mcve]--includind DDL including indexes. Please clarify via edits, not comments. – philipxy Nov 02 '18 at 18:24
  • @philipxy Why do you assume that his requirement is blind, esp since the OP didn't actually give an justification for the request? Different joins are not the same, neither in what they produce nor in how they are implemented. DB optimization may be able to reduce efficiency differences, but still not guaranteed to be identical. Perhaps conceptually one join can be described in terms of other joins... and you could even draw some nice Venn diagrams, but that does not trivialize crafting an efficient set of SQL statements. – C Perkins Nov 02 '18 at 19:43
  • @gooponyagrinch Definitely generating another intermediate table can impact efficiency, but there are plenty of other factors to consider. The most critical is how often you'd have to rerun the cross join to update the table to ensure accurate, timely data. If the date and id pairs do not change often, then try it. But how will you detect or know when they do require an updated cross join? It can end up being as expensive to ensure that as just re-running the original query. – C Perkins Nov 02 '18 at 19:48
  • @CPerkins I dont understand your point. *Because* "the OP didn't actually give an justification for the request" *except that "cross join" appears*, they are acting blindly. They are worrying about "cross join" because they see the words without regard to the context. I did not trivialize crafting code; I specifically addressed it. (SImilarly, "Different joins are not the same, neither in what they produce nor in how they are implemented" & "still not guaranteed to be identical" are odd because the only guarantee is a result that *is* "conceptutally" "described in terms of other joins".) – philipxy Nov 02 '18 at 20:04
  • @philipxy The OP did indeed explicitly point out the cross join. What the OP did not do was explain why, so we can't assume he has a blind angst of the words "cross join". We don't know what tests, traces, bench marking, comparisons, etc. on which the OP is basing the request. I guess I also don't understand your point, because real, practical implementations do not necessarily match conceptual understanding. Why don't we ask? gooponyagrinch, would you mind elaborating on why the cross join should be avoided? – C Perkins Nov 02 '18 at 20:12
  • @CPerkins I explained why I think it is quite likely that the asker *is* solely concerned with "cross join" appearing. Read their last two paragraphs. Anyway despite that I didn't assume, I stated that blindly reacting is bad. And again, "Why don't we ask?" is odd since my first comment *does* ask--*first thing*. I don't see what I or the asker have actually written reflected in your comments. ("real, practical implementations" *do* "necessarily match conceptual understanding"--they just involve *more*.) – philipxy Nov 02 '18 at 20:38

1 Answers1

1

Use the table of all dates as the "from table" and left join the data, this still returns each date.

SELECT
      d.date
    , t.id
    , COALESCE(SUM(t.sales),0) sales
    , COALESCE(SUM(t.revenue),0) revenue
FROM all_dates d
LEFT JOIN table_data t
    ON d.date = t.date
WHERE d.date BETWEEN '2018-10-31' AND '2018-11-04'
GROUP BY
      d.date
    , t.id

Another alternative (to avoid the cross join) could be to use generate series but for this - in Redshift - I suggest this former answer. I'm a fan of generate series, but if you already have a table I would probably stay with that (but this is based on what little I know about your query etc.).

Paul Maxwell
  • 33,002
  • 3
  • 32
  • 51
  • By the way; *"this query is a much larger date range with way more data and takes hours"* you should review the explain plan and especially the indexing of the tables involved. This might help in future questions https://explain.depesz.com/ – Paul Maxwell Nov 02 '18 at 23:42